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Natural  Object  Categorization 

Aaron  F.  Bobick 


'  Abstract 

r-' 

This  thesis  addresses  the  problem  of  categorizing  natural  objects.  To 
provide  a  criteria  for  categorization  we  propose  that  the  purpose  of  a  cat¬ 
egorization  is  to  support  the  inference  of  unobserved  properties  of  objects 
from  the  observed  properties.  Because  no  such  set  of  categories  can  be  con¬ 
structed  in  an  arbitrary  world,  we  present  the  Principle  of  Natural  Modes  as 
a  claim  about  the  structure  of  the  world. 

We  first  define  an  evaluation  function  that  measures  how  well  a  set  of 
categories  supports  the  inference  goals  of  the  observer.  Entropy  measures  for 
property  uncertainty  and  category  uncertainty  are  combined  through  a  free 
parameter  that  reflects  the  goals  of  the  observer.  Natural  categorizations 
are  shown  to  be  those  that  are  stable  with  respect  to  this  free  parameter. 
The  evaluation  function  is  tested  in  the  domain  of  leaves  and  is  found  to 
be  sensitive  to  the  structure  of  the  natural  categories  corresponding  to  the 
different  species. 

We  next  develop  a  categorization  paradigm  that  utilizes  the  categoriza¬ 
tion  evaluation  function  in  recovering  natural  categories.  A  statistical  hy¬ 
pothesis  generation  algorithm  is  presented  that  is  shown  to  be  an  effective 
categorization  procedure^  Examples  drawn  from  several  natural  domains  are 
presented,  including  data  known  to  be  a  difficult  test  case  for  numerical  cat¬ 
egorization  techniques.  We  next  extend  the  categorization  paradigm  such 
that  multiple  levels  of  natural  categories  are  recovered;  by  means  of  recur¬ 
sively  invoking  the  categorization  procedure  both  the  genera  and  species  are 
recovered  in  a  population  of  anaerobic  bacteria. 

Finally,  a  method  is  presented  for  evaluating  the  utility  of  features  in 
recovering  natural  categories.  This  method  also  provides  a  mechanism  for 
determining  which  features  are  constrained  by  the  different  processes  present 
in  a  multiple  modal  world.  _  _ - 

Thesis  Supervisor:  Dr.  Whitman  Richards 

Professor,  Department  of  Brain  and  Cognitive  Sciences 
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Chapter  1 
Introduction 


1.1  The  Problem:  Object  Categorization 

Let  us  travel  back  to  the  jungle  of  our  ancestors.  We  see  an  object  in  the 
distance,  moving  slowly  on  four  legs.  The  object  has  black  stripes  on  a  beige 
coat  of  fur,  a  large  appendage  in  front  (the  “head”)  with  sharp  serrations  in  a 
hinged  opening,  long  whisker-like  hairs  in  front,  and  a  narrow,  elongated  rear 
appendage  that  oscillates.  Suddenly,  we  notice  the  object  has  turned  and 
two  round,  black  objects,  recessed  in  the  front  appendage,  are  now  pointed 
in  our  direction.  As  it  begins  to  move  toward  us,  we  quickly  decide  that  this 
is  an  appropriate  time  to  leave,  and  with  due  haste. 

If  analyzed  only  casually,  the  above  scenario  appears  to  be  an  example  of 
simple  and  rational  behavior.  We  view  an  object  which  we  perceive  to  be  a 
tiger,  we  know  that  tigers  feast  on  people,  and  thus  we  decide  to  run  for  our 
lives.  But  let  us  examine  the  scenario  in  greater  detail.  Our  first  (perceptual) 
act  is  to  encode  some  stimulus  information:  an  object1  with  four  downward 
pointing  appendages,  translating  across  our  visual  field,  endowed  with  certain 
physical  characteristics.  Our  last  (behavioral)  act  is  a  decision  to  flee,  based 
upon  knowledge  of  the  potential  behavior  of  that  object.  But,  somewhere  in 
between  those  two  events,  we  make  the  critical  inference  about  unobserved 
properties  of  an  object  from  the  observed  properties.  Given  only  a  sensory 
description  of  an  object,  we  are  able  to  make  inferences  about  unobservable 


properties  such  as  the  intentions  of  an  animal.  How  is  such  an  inference 
possible? 

The  obvious,  in  fact  seemingly  trivial,  answer  is  that  sensory  informa¬ 
tion  available  is  sufficient  to  determine  that  the  object  is  a  tiger;  thus,  our 
knowledge  about  the  behavior  of  tigers  allows  us  to  predict  the  behavior  of 
the  object.  That  is,  given  the  sensory  information,  we  conclude  that  the 
object  is  a  member  of  the  “tiger”  category  and  thus  we  expect  the  object 
to  behave  in  a  manner  consistent  with  the  behavior  of  other  objects  of  the 
same  category. 

But  this  answer  to  the  question  of  how  the  observer  makes  predictions 
about  the  behavior  of  objects  is  not  adequate.  Simply  announcing  a  category 
to  which  an  object  belongs  does  not  provide  the  observer  with  the  necessary 
predictive  power.  For  example,  suppose  we  view  the  previously  described  sit¬ 
uation,  but  decide  that  the  object  in  question  belongs  to  the  category  “large 
fuzzy  thing.”  In  this  case,  our  ability  to  make  inferences  about  the  behavior 
of  the  object  is  limited,  and  our  response  might  not  be  appropriate  for  the 
situation.  The  large  fuzzy  thing  would  partake  of  an  early  supper.  Although 
the  category  asserted  is  correct,  “large  fuzzy  thing”  does  not  support  the 
inferences  that  are  necessary  for  observer  to  interact  successfully  with  his 
environment. 

However,  the  intuition  that  the  observer  accomplishes  his  inference  task 
by  determining  the  “correct”  category  of  an  object  is  strong.  The  only  diffi¬ 
culty  with  the  previous  example  was  that  some  categories  (like  “tiger”)  are 
more  useful  for  inference  than  others  (“large  fuzzy  thing”).  Therefore,  if  the 
observer  is  to  predict  the  important  behavior  of  objects  by  determining  the 
categories  to  which  they  belong,  then  those  categories  must  be  matched  both 
to  the  goals  of  the  observer  and  to  the  structure  of  the  world.  In  particular, 
these  categories  must  satisfy  two  requirements.  First,  using  only  sensory 
information,  the  observer  must  be  able  to  determine  the  category  to  which 
an  object  belongs.  Second,  once  the  category  of  an  object  is  established, 
membership  of  the  object  in  that  category  must  allow  the  observer  to  make 
important  inferences  about  the  behavior  of  the  object.  Which  inferences  are 
important  depends  upon  the  goals  of  the  observer. 

As  we  will  discuss  in  the  next  section,  we  have  no  a  priori  reason  to 
believe  that  a  set  of  categories  exist  that  permits  the  observer  to  both  identify 
the  category  of  an  object  from  sensory  information  and  predict  unobserved 
properties  as  well.  And  if  such  categories  do  exist,  how  would  the  observer 
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come  to  know  them?  The  goal  of  this  thesis  is  to  understand  and  provide  a 
solution  to  the  problem  of  discovering  the  useful  categories  in  the  world. 

We  can  decompose  the  object  categorization  problem  into  the  following 
three  questions: 

•  What  are  the  necessary  conditions  that  must  be  true  of  the  world  if 
a  set  of  categories  is  to  be  useful  to  the  observer  in  predicting  the 
important  properties  of  objects? 

•  What  are  the  characteristics  of  such  a  set  of  categories? 

•  How  does  the  observer  acquire  the  categories  that  support  the  infer¬ 
ences  required? 

These  problems  follow  one  another  directly.  By  identifying  the  structure 
in  the  world  that  must  be  present  in  order  for  the  observer  to  be  able  to 
construct  a  set  of  categories  that  supports  important  inferences,  we  are  able 
to  specify  the  characteristic  structure  that  such  a  set  of  categories  must 
exhibit.  Once  we  have  identified  these  characteristics  we  can  attempt  to 
recover  categories  that  satisfy  these  conditions. 

1.2  A  Necessary  Condition:  Natural  Modes 

We  have  stated  that  goal  of  categorization  is  to  permit  the  inference  of  impor¬ 
tant  properties  of  objects.  Often,  however,  many  of  the  important  properties 
of  objects  are  not  directly  observable.  There  is  no  direct  sensory  stimulus 
for  “tends  to  eat  human  beings  for  dinner.”  Thus,  if  the  observer  it  to  ac¬ 
complish  this  categorization  task,  then  he  is  required  to  predict  unobserved 
properties  from  observed  properties.  How  is  this  possible?  Certainly,  one 
could  construct  a  world  in  which  the  inference  task  was  not  feasible.  If  the 
important  (unobserved)  properties  of  objects  are  independent  of  the  prop¬ 
erties  available  to  the  observer  through  his  sensory  mechanisms,  then  no 
useful  inferences  could  be  made.  No  set  of  categories  could  be  constructed 
that  would  allow  the  observer  to  predict  the  behavior  of  objects.  There¬ 
fore,  if  we  assume  that  useful  categorization  is  possible,  if  we  accept  human 
perception  as  an  existence  proof  that  the  goal  of  making  reliable  inferences 
about  the  properties  of  objects  can  be  achieved,  then  it  must  be  the  case 
that  our  world  structured  in  a  special  way. 
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To  capture  this  structuring  of  the  world,  we  propose  the  Principle  of 
Natural  Modes,  a  claim  that  the  world  does  not  consist  of  arbitrary  objects, 
but  of  objects  highly  constrained  by  the  processes  that  create  them  and  the 
environment  that  acts  upon  them.  Natural  modes  —  clusterings  of  objects 
in  properties  important  to  the  interaction  between  objects  and  their  environ¬ 
ment  —  cause  objects  to  display  large  degrees  of  redundancy;  for  example, 
most  objects  with  beaks  also  have  wings,  claws,  and  feathers.  Because  ob¬ 
jects  within  the  same  natural  mode  exhibit  the  same  behavior  in  terms  of 
their  important  properties,  the  natural  modes  are  an  appropriate  sets  of 
categories  for  the  recognition  task.  Once  the  natural  mode  of  an  object  is 
established,  important  properties  of  that  object  can  be  inferred.  Stated  suc¬ 
cinctly,  natural  modes  provide  the  basis  for  a  natural  categorization  of  the 
world. 

The  goal  of  the  observer,  then,  is  to  recover  the  natural  mode  categories  in 
the  world.  Our  task  is  to  develop  the  theoretical  tools  necessary  to  allow  the 
observer  to  accomplish  achieve  his  goal.  In  the  chapters  that  follow,  we  will 
develop  more  fully  the  concept  of  natural  modes,  derive  a  measure  sensitive 
to  whether  a  set  of  categories  corresponds  to  natural  clusters,  and  generate 
a  procedure  by  which  the  observer  can  recover  the  natural  categories  from 
the  data  provided  by  the  environment. 

1.3  Thesis  Outline 

The  thesis  is  logically  divided  into  three  parts.  The  first  part  develops  the 
philosophical  groundwork  for  the  recovery  of  natural  categories.  Chapter  2 
begins  with  a  discussion  of  the  goals  of  categorization  and  how  those  goals 
require  an  appropriately  structured  world.  The  Principle  of  Natural  Modes 
is  then  developed  as  a  characterization  of  the  structure  of  the  world  and 
as  a  basis  for  categorization.  The  philosophical,  physical,  and  psychological 
implications  of  the  claim  of  natural  categories  are  explored;  in  particular 
we  reconcile  formal  logical  arguments  against  natural  categories  with  the 
physical  and  psychological  evidence  supporting  their  existence.  Chapter  3 
examines  some  of  the  previous  work  in  the  fields  of  cognitive  science,  clus¬ 
ter  analysis,  and  machine  learning  that  is  relevant  to  recovery  of  natural 
categories. 

The  second  paxt  of  the  thesis,  consisting  of  chapter  4,  addresses  the  prob- 
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lem  of  measuring  how  well  a  set  of  categories  reflects  the  structure  of  the 
natural  modes.  We  develop  a  measure,  based  on  information  theory,  that 
assess  how  well  a  set  of  categories  supports  the  goals  of  the  observer:  the  re¬ 
liable  inference  of  unobserved  properties  from  observed  properties.  Because 
it  is  the  existence  of  natural  modes  that  permits  the  observer  to  accomplish 
this  inference  task,  we  axgue  that  a  set  of  categories  —  a  categorization  — 
that  supports  the  goals  of  the  observer  must  reflect  the  natural  modes.  The 
behavior  of  the  measure  is  demonstrated  in  the  natural  domain  of  leaves. 

Finally,  in  chapters  5  and  6,  we  address  the  issue  of  the  recovering  the 
natural  modes  from  a  set  of  data.  In  chapter  5,  we  define  a  categorization 
paradigm  inspired  by  the  formal  learning  theory  work  of  Osherson,  Stob, 
and  Weinstein  [1986].  Within  the  context  of  this  paradigm,  we  develop  a  dy¬ 
namic  categorization  algorithm  which  makes  use  of  the  measure  developed 
in  chapter  4  to  evaluate  hypothesized  categorizations.  The  performance  of 
this  algorithm  is  tested  in  three  natural  domains,  including  a  set  of  data  that 
have  served  as  a  test  for  other  categorization  systems.  The  results  indicate 
that  the  categorization  algorithm  is  an  effective  method  for  recovering  nat¬ 
ural  categories.  An  analysis  of  the  competence  of  the  algorithm  is  provided 
and  predicts  the  observed  behavior. 

In  chapter  6,  we  extend  the  analysis  of  the  categorization  algorithm  into 
domains  in  which  there  are  multiple  natural  clusterings.  Such  domains  are 
formed  when  more  than  one  level  process  constrains  the  properties  of  objects. 
For  example,  we  will  consider  the  domain  of  infectious  bacteria  where  there 
is  structure  at  both  the  genus  and  species  level.  We  develop  a  procedure 
by  which  the  observer  can  recover  both  levels  of  categories.  Furthermore, 
we  provide  a  method  by  which  the  observer  can  determine  which  properties 
of  objects  are  constrained  by  each  level  of  process.  This  same  mechanism 
enables  the  observer  to  evaluate  the  utility  of  a  property  for  performing  the 
categorization  task. 

In  the  conclusion  of  the  thesis,  chapter  7,  we  summarize  the  results  of  the 
previous  sections,  once  again  consider  the  utility  of  recovering  the  natural 
categories  in  the  world,  and  discuss  potential  extensions  to  the  work. 


Chapter  2 

Natural  Categories 


We  begin  our  study  of  natural  object  categories  by  examining  a  task  that 
explcitly  makes  use  of  such  categories:  object  recognition.  By  recognition 
we  simply  mean  the  act  of  announcing  some  category  when  an  object  is 
presented.  Our  first  consideration  will  be  the  goal  of  recognition,  which  we 
will  propose  to  be  the  inference  of  important  unobserved  properties  from 
observed  properties.  If  recognition  is  to  be  performed  by  announcing  the 
category  to  which  an  object  belongs,  what  kinds  of  categories  would  per¬ 
mit  the  observer  to  attain  this  goal?  Under  what  conditions  is  such  a  goal 
possible?  To  help  achieve  these  goals,  we  will  propose  the  Principal  of  Nat¬ 
ural  Modes:  a  claim  —  about  the  world  —  that  there  exist  sets  of  natural 
categories  ideally  suited  to  the  task  of  making  useful  inferences.  This  claim 
will  need  to  be  reconciled  with  philosophical  and  logical  arguments  against 
the  ontological  existence  of  such  categories.  In  support  of  natural  modes 
and  their  use  for  recognition  we  will  present  evidence  from  both  the  physical 
world  and  the  psychologies  of  various  organisms.  Finally,  we  will  be  able  to 
pose  the  categorization  problem  as  the  discovery  of  natural  mode  categories 
in  the  world. 


2.1  The  Goal  of  Recognition 

Suppose  we  wish  to  construct  a  machine  (or  organism)  which  is  to  perform 
object  recognition  by  announcing  some  category  for  each  object  encountered. 
What  set  of  categories  would  be  appropriate?  Certainly  we  cannot  answer 
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■  Figure  2.1:  A  canonical  observer  viewing  a  canonical  object.  The  Oi's  and  Uj's 

|  represent  observed  and  unobserved  properties,  respectively.  The  goal  of  the  observer 

I  is  to  infer  the  Uj's  from  the  Oi's. 


this  question  without  placing  further  constraint  on  the  output  of  this  ma¬ 
chine.  Otherwise,  any  arbitrary  categorization  would  be  valid,  e.g.  “announce 
category  1  if  the  the  object  is  less  than  100  feet  away;  announce  category 
2  otherwise.”  Therefore  we  need  to  provide  an  additional  constraint  as  to 
what  makes  a  suitable  or  useful  categorization. 

To  provide  such  a  constraint,  let  us  propose  that  the  object  recognition 
task  —  and  therefore  object  categorization  —  has  as  its  goal  the  following: 

Goal  of  Recognition  is  to  predict  important  unobserved  prop¬ 
erties  from  observed  properties. 

This  goal  requires  that  when  an  object  is  “recognized,”  which  we  have  de¬ 
fined  to  mean  when  some  category  is  announced,  it  should  be  the  case  that 
inferences  about  that  the  unobserved  properties  of  the  object  can  be  reliably 
asserted.  Properties  of  particular  interest  are  those  that  affect  the  object’s 
interaction  with  the  environment,  of  which  the  observer  is  a  part. 
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To  illustrate  the  goal,  consider  our  observer  in  Figure  2.1.  While  viewing 
some  object  the  observer  measures  certain  observable  properties  Oi .  The  ob¬ 
served  properties  may  include  very  simple  quantities  such  as  dominant  color 
or  overall  length,  or  they  may  be  more  complex  measures  such  as  a  descrip¬ 
tion  of  the  basic  geometric  parts  of  the  object  [Hoffman  and  Richards,  1985], 
From  these  properties,  the  observer  wants  to  infer  the  unobserved  properties 
Uj.  These  unobserved  properties  may  include  function  (“ something  to  sit 
upon”)  or  behavior  and  affordances  [Gibson,  1979]  ( “ something  which  moves 
and  will  try  to  eat  me”).  This  basic  inference  is  really  the  basic  problem  of 
perception,  and  we  can  use  this  goal  of  recognition  to  provide  criteria  for  an 
appropriate  set  of  categories. 

Notice,  however,  that  being  able  to  make  reliable  inferences  about  an 
object’s  properties  from  its  category  is  not  sufficient  to  satisfy  the  goal  of 
recognition.  Recognition  requires  using  one  set  of  properties  (observed)  to 
make  inferences  about  another  set  of  properties  (unobserved).  Thus,  we 
need  not  only  the  ability  to  infer  reliably  an  object’s  (unobserved)  proper¬ 
ties  from  its  category,  but  also  the  ability  to  infer  an  object’s  category  from 
its  (observed)  properties.  For  example,  the  validity  of  the  predictions  should 
degrade  gracefully  as  less  observed  information  is  provided;  it  will  often  be 
the  case  that  the  observer  only  recovers  a  subset  of  the  observable  proper¬ 
ties.  Also,  the  observer  should  be  able  to  make  predictions  about  objects 
not  previously  viewed.  That  is,  the  observer  must  be  able  to  generalize  ap¬ 
propriately  such  that  the  predictions  about  the  non-observed  properties  of  a 
novel  object  tend  to  remain  valid. 

As  an  aside,  we  should  address  the  (skeptic’s)  question  of  why  use  cate¬ 
gories  at  all  to  satisfy  the  goal  of  recognition.  If  one’s  goal  is  only  to  make 
predictions  about  unobserved  properties  from  the  information  provided  by 
observed  properties,  then  a  more  direct  strategy  would  be  to  recover  the 
relationships  between  the  two.  For  example,  one  could  estimate  all  the  con¬ 
ditional  probabilities  (of  every  order)  and  use  these  estimates  to  make  predic¬ 
tions.  One  response  to  this  argument  is  that  we  have  not  (yet)  claimed  that 
categories  are  the  best  mechanism  for  solving  the  inference  problem.  Rather, 
if  given  the  problem  of  constructing  categories  for  the  recognition  task,  then 
reliable  inference  is  one  means  of  defining  suitable  criteria.  However,  we 
actually  do  wish  to  make  the  claim  that  categories  are  an  efficient  and  ef¬ 
fective  means  of  achieving  the  goal  of  reliable  inference  about  unobserved 
properties.  We  must  postpone  the  defense  of  this  claim  until  we  discuss  the 


principle  of  natural  modes,  to  be  presented  in  the  next  section. 

Given  the  goal  of  constructing  a  set  of  categories  consistent  with  the 
proposed  goal  of  recognition,  is  it  possible  for  an  observer  to  perform  such 
a  categorization  of  objects?  Will  his  categorization  permit  the  inference  of 
unobserved  properties?  The  answers  to  these  questions  clearly  depend  on 
the  domain  in  which  the  recognition  system  is  to  operate.  If  there  is  no 
correlation  between  the  sensory  data  and  the  behavior  of  an  object,  then 
no  such  inference  is  possible.  If  every  object  in  a  world  (including  witches, 
bicycles,  and  trees)  is  spherical  in  shape,  blue  in  color,  and  matte  of  surface, 
then  such  visual  attributes  would  be  useless  for  inferences  of  unobserved 
properties  important  to  the  observer.  Under  such  circumstances  a  visual 
recognition  system  which  performed  useful  classification  could  not  be  built. 
Therefore,  if  we  are  to  claim  that  the  goal  of  the  recognition  system  is  to  place 
objects  in  the  world  into  categories  that  permit  the  prediction  of  unobserved 
properties,  then  for  such  a  system  to  be  successful  it  must  be  the  case  that 
the  world  is  structured  in  such  a  way  as  to  make  these  inferences  possible. 
This  is  a  strong  claim,  and  one  which  is  fundamentally  different  from  stating 
that  the  only  structure  present  is  that  which  is  imposed  upon  the  world  by 
the  observer’s  interpretation. 

2.2  Natural  Modes 

If  we  take  the  human  vision  system  as  an  existence  proof  that  it  is  possible 
to  define  a  categorization  of  objects  that  permit  infer*  nces  about  an  objects 
unobserved  properties  (e.g.  I  can  visually  categorize  some  object  as  a  “horse” 
and  predict  many  of  its  unobserved  properties  based  upon  that  categoriza¬ 
tion),  then  it  must  be  the  case  that  the  natural  world  is  structured  in  some 
particular  way.  What  would  be  the  basis  of  such  structure? 

To  gain  insight  into  this  question,  consider  the  Gedanken  experiment  of 
giving  a  grade  school  art  class  the  assignment  of  drawing  pictures  of  imag- 
:nary  animals  —  animals  the  children  have  never  seen  and  about  which 
nothing  has  been  said.  The  results  are  as  varied  as  the  children  who  produce 
them:  multiple-headed  “monsters”,  flying  elephants,  and  other  composite 
animals  are  produced.  Completely  bizarre-looking  creatures  also  emerge. 
There  seems  to  be  no  limit  to  the  the  number  of  animals  that  one  could 
imagine.  Yet,  they  live  only  in  the  mind,  and  in  the  world  of  children’s  toys 


which  produce  creatures  such  as  Bee- Lions. 

If  these  animals  could  exist,  (i.e.  we  could  physically  construct  them)  why 
don’t  they?  In  some  instances,  the  laws  of  biological  physics  simply  preclude 
their  feasibility.  Flying  elephants  would  require  a  weight,  surface  area,  and 
muscle  relation  that  cannot  be  created  from  the  biological  hardware  used  to 
make  an  elephant  [McMahon,  1975].  Other  animals,  although  feasible,  may 
not  exist  because  such  creatures  were  either  never  formed  by  mutation,  or,  if 
formed,  they  were  made  extinct  by  forces  in  the  environment.  In  this  latter 
case  and  in  the  case  of  impossible  animals,  we  can  view  the  situation  as  an 
entity  (the  animal)  which  did  not  satisfy  the  environmental  constraints  in 
effect  at  the  time.  In  fact,  given  the  complexity  of  the  natural  world  and 
the  extensive  pressures  brought  to  bear  by  Nature  on  an  organism,  most 
arbitrarily-designed  animals  would  perish,  because  the  chance  of  creating 
arbitrary  organisms  which  would  be  well-suited  to  the  environment  is  almost 
zero.  Unlike  the  world  of  the  imagination  or  children’s  toys,  the  natural  world 
cannot  contain  objects  of  arbitrary  configurations. 

As  such,  the  existing  species  are  special  in  an  important  way.  The  species 
represent  finely  tuned  structures,  Nature’s  solutions  to  the  constraint  sat¬ 
isfaction  problem  imposed  by  the  myriad  of  negative  environmental  con¬ 
straints.  “Survival  of  the  fittest”  may  be  interpreted  as  simply  the  statement 
that  the  surviving  species  satisfy  the  environmental  constraints  better  than 
any  other  species  competing  for  the  same  resources.  Because  of  the  extent 
of  these  constraints,  each  of  the  solutions  must  be  highly  constrained;  that 
is,  there  is  no  small  set  of  properties  of  an  organism  which  is  sufficient  for  its 
survival.  Stream-lined  contours,  fins,  eyes  on  opposite  sides  of  their  body  — 
these  attributes  combined  with  a  vast  set  of  internal  structures  permit  fish 
to  survive  in  the  aquatic  environment. 

Also,  these  solutions  tend  to  be  disparate.  [Mayr,  1984;  Stebbins  and  Ay¬ 
ala,  1985].  Because  species  of  similar  construction  will  be  competing  for  the 
same  resources,  variations  in  properties  important  to  the  organism’s  survival 
are  removed,  unless  the  variations  are  large  enough  such  that  the  organism 
is  now  in  a  different  niche.  The  pressure  of  natural  selection  moves  the  evo¬ 
lution  of  species  to  a  discrete  or  clustered  sampling  along  those  dimensions 
relevant  to  a  species  survival.  We  refer  to  this  clustering  as  the  “Principle  of 
Natural  Modes,”  and  because  it  is  central  to  our  development  of  a  natural 
categorization  we  restate  it  as  follows: 


Principle  of  Natural  Modes:  Environmental  pressures  force 
objects  to  have  non-arbitrary  configurations  and  properties  that 
define  object  categories  in  the  space  of  properties  important  to 
the  interaction  between  objects  and  the  environment. 

We  do  not  live  in  a  world  of  randomly  created  objects  and  visual  scenes,  but 
in  a  world  of  structure  and  form. 

To  refine  our  claim  about  natural  modes,  we  let  us  make  explicit  the 
claims  that  are  being  made,  as  well  as  those  that  are  not.  First,  the  existence 
of  natural  modes  implies  that  objects  do  not  exhibit  uniform  distributions 
of  properties.  Rather,  objects  display  a  great  deal  of  redundancy ,  redun¬ 
dancy  created  by  the  complex  sets  of  constraints  acting  upon  objects.  For 
example,  we  do  not  see  the  mythical  Griffin  (half  eagle,  half  lion).  Objects 
with  beaks  also  (tend  to)  have  feathers  and  wings  and  claws.  Redundancies 
such  as  these  make  it  “easy”  to  recognize  an  object  as  a  bird:  a  few  clues 
are  sufficient.  Second,  we  do  not  intend  to  restrict  the  claim  to  only  natural 
objects;  in  section  2.4.1  we  will  discuss  constraints  acting  on  mem-made  ob¬ 
jects  as  well.  Finally,  we  are  not  claiming  there  exists  a  unique  set  of  object 
categories.  We  allow  for  the  possibility  that  the  clustering  of  objects  along 
dimensions  important  to  the  interaction  between  objects  and  the  environ¬ 
ment  may  be  “scale”  dependent:  clustering  occurs  at  different  levels  of  the 
object  hierarchy.  For  example,  consider  the  division  between  mammals  and 
birds,  and  then  the  separation  between  cows  and  mice.  The  clustering  which 
separates  mammals  from  birds  occurs  at  a  level  of  biological  processes  much 
“higher”  than  that  which  separates  cows  from  mice.  We  will  further  develop 
the  concept  of  levels  of  categorization  in  chapters  4  and  5  when  we  consider 
matching  the  goals  of  the  observer  to  the  structure  of  the  world.  For  now  we 
can  assume  that  “natural  mode  categories”  refer  some  selection  of  categories 
corresponding  to  a  natural  clustering  at  some  level. 

In  the  interest  of  completeness,  two  important  comments  need  to  be 
made.  The  first  is  that  we  are  not  stating  that  there  exist  objective  cate¬ 
gories  in  the  world,  independent  of  any  categorization  criteria.  Rather,  we 
are  stating  that  there  exists  a  clustering  along  dimensions  which  are  impor¬ 
tant  to  the  interaction  between  the  object  and  its  environment.  Therefore, 
if  some  sensory  apparatus  is  encoding  properties  related  to  these  important 
dimensions,  then  there  will  be  a  clustering  in  the  space  defined  by  that  sen¬ 
sory  mechanism.  The  reason  for  making  this  point  here  is  that  there  is  a 
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large  body  of  work  by  both  philosophers  and  logicians  arguing  that  there  do 
not  exist  objective  categories  in  the  world.  By  restricting  the  claim  to  con¬ 
sider  only  those  properties  important  to  the  interaction  between  the  object 
and  the  environment  we  can  finesse  the  problem  of  objective  categorization. 
In  section  2.3.1  we  will  provide  a  brief  review  of  the  arguments  against  the 
ontological  status  of  natural  categories  and  we  will  discuss  how  those  ideas 
relate  to  the  claim  of  natural  modes. 

The  second  point  is  that  the  Principle  of  Natural  Modes  is  similar  to 
Marr’s  “Fundamental  Hypothesis”  which  argued  that  if  a  collection  of  certain 
observable  properties  tended  to  be  grouped,  then  certain  other  properties 
(unobservable)  would  tend  to  group  similarly  [Marr,  1970].  The  principal 
difference  is  that  Marr  did  not  provide  a  motivation  for  why  one  would 
expect  to  find  certain  observable  properties  grouped  in  clusters.  In  fact,  the 
claim  of  natural  modes  by  itself  is  not  sufficient  to  provide  a  clustering  of 
objects  in  the  feature  space  of  observable  properties.  Therefore  we  extend 
our  claim  with  the  following  addition: 

Accessibility:  The  properties  that  axe  important  to  the  inter¬ 
action  of  an  object  with  its  environment  are  (at  least  partially) 
reflected  in  observable  properties  of  the  object. 

Fortunately,  this  claim  is  easily  justified.  For  example,  the  basic  shape  of 
an  object  usually  constrains  how  the  object  interacts  with  its  environment. 
The  legs  of  an  animal  permit  it  mobility.  The  color  of  an  object  is  often 
related  to  its  survival:  plants  are  green  and  polar  bears  are  white.  As  such, 
the  important  aspects  of  an  object  tend  to  be  reflected  in  properties  which 
are  observable.  Therefore,  the  Principle  of  Natural  Modes  taken  together 
with  claim  of  Accessibility  provide  a  basis  for  why  one  might  expect  to  find 
a  clustered  distribution  of  objects  in  an  observer’s  feature  space. 

Finally  we  can  combine  the  goal  of  the  observer  —  to  construct  a  set  of 
categories  which  allow  the  observer  to  predict  important  unobserved  prop¬ 
erties  of  objects  —  with  the  claim  of  natural  modes.  We  make  the  following 
claim  about  the  appropriate  set  of  categories  for  recognition: 

Natural  Categorization:  If  an  observer  is  to  make  correct  in¬ 
ferences  about  objects’  unobserved  properties  from  the  observed 
properties,  then  he  should  categorize  objects  according  to  their 
natural  modes. 
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This  claim  follows  naturally  from  our  goal  of  recognition  and  the  proposed 
Principle  of  Natural  Modes.  Given  that  the  observer  is  seeking  to  infer 
the  properties  which  describe  how  an  object  interacts  with  its  environment, 
and  given  that  these  properties  cluster  according  to  natural  modes,  then 
the  observer  should  attempt  to  categorize  objects  according  to  their  natural 
modes.  Accessibility  states  that  this  goal  can  be  accomplished  using  sensory 
data. 

Before  proceeding  to  the  next  sections,  let  us  return  to  the  skeptic’s 
question  of  why  one  should  use  categories  to  accomplish  the  proposed  goal 
of  recognition  —  the  inference  of  unobserved  properties  from  observed  prop¬ 
erties.  Now  that  we  have  presented  the  Principle  of  Natural  Modes  we  can 
argue  that  the  world  contains  categories  of  objects  which  support  generaliza¬ 
tion.  For  example,  suppose  one  believes  that  a  certain  set  of  objects  forms 
a  natural  category,  and  that  one  of  those  objects  exhibits  a  certain  (in  gen¬ 
eral)  unobserved  property,  e.g.  it  attacks  human  beings.  Then,  one  would 
make  the  prediction  that  all  objects  of  this  category  would  exhibit  the  same 
property.  If  one  were  using  standard  conditional  probabilities,  one  could 
not  make  this  assertion  without  some  particular  a  priori  probability  state¬ 
ment  about  how  to  generalize  over  objects  of  “similar”  observed  properties. 
But  such  a  statement  is  equivalent  to  believing  in  the  existence  of  natural 
categories.  Thus,  a  more  natural  (and  more  efficient)  method  of  using  this 
knowledge  is  to  explicitly  represent  the  categories  themselves. 

In  the  next  three  sections,  we  will  consider  arguments  against  and  ev¬ 
idence  for  the  existence  of  natural  modes.  The  primary  argument  against 
natural  modes  stems  from  the  work  of  philosophers  and  logicians  considering 
the  abstract  implications  of  natural  categories.  The  favorable  evidence,  how¬ 
ever,  is  derived  from  consideration  of  the  physical  world,  and  the  organisms 
that  inhabit  it. 


2.3  The  Philosophical  Issue  of  Natural 
Categories 

2.3.1  Questions  of  ontology 

Ontology  may  be  described  as  the  branch  of  philosophy  that  concerns  what 
exists  [Carey,  1986],  As  mentioned  in  section  2.2  there  has  been  considerable 
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attention  paid  to  the  question  of  whether  categories  can  really  be  said  to  exist 
in  the  world,  rather  than  being  constructs  in  our  head.  In  this  section  we 
will  provide  a  brief  review  of  the  logical  argument  against  the  existence  of 
objective  natural  categories.  Then,  we  will  reconcile  this  argument  with  the 
principle  of  natural  modes. 

The  basic  issue  at  hand  is  do  categories  exist  in  the  world  independent  of 
some  observer?  Would  “rabbits”  be  a  more  natural  category  than  “round- 
or-blue-things”  if  there  was  no  organism  to  perceive  them?  Prima  facie,  the 
principle  of  natural  modes  would  argue  for  the  existence  of  such  categories. 
However,  we  will  see  that  natural  categories  can  only  be  said  to  exist  if  we 
provide  constraint  external  to  objects  themselves;  an  outside  oracle  will  be 
required  to  restrict  what  aspects  of  an  object  may  be  considered  as  relevant 
to  categorization.  Only  then  is  it  reasonable  to  consider  one  categorization 
of  objects  as  more  natural  than  another. 

Perhaps  the  most  complete  discussion  of  the  subjective  nature  of  cate¬ 
gories  is  provided  in  Goodman  [1951].  There  it  is  demonstrated  that,  by 
the  appropriate  choice  of  logical  primitives  with  which  to  describe  objects, 
any  similarity  relationship  between  objects  can  be  constructed.  Thus,  if  a 
natural  set  of  categories  is  defined  by  some  measure  on  a  similarity  metric, 
then  any  categorization  may  be  selected.  Though  thorough,  Goodman’s  pre¬ 
sentation  is  quite  dense  and  difficult  to  recount.  As  such  we  will  provide  an 
alternative  form  of  the  argument  as  given  by  Watanabe  [1985].  This  formu¬ 
lation  —  referred  to  as  the  Ugly  Duckling  Theorem  —  makes  the  issues  of 
categorization  quite  clear. 

Let  us  state  the  theorem  directly  and  then  sketch  the  proof: 

Ugly  Duckling  Theorem:  Insofar  as  we  use  a  finite  set  of  pred¬ 
icates  that  are  capable  of  distinguishing  any  two  objects  consid¬ 
ered,  the  number  of  predicates  shared  by  any  two  such  objects  is 
constant,  independent  of  the  choice  of  two  objects.  [Watanabe, 

1985,  p.  82] 

We  will  provide  a  proof  of  this  remarkable  result  for  one  special  case;  through 
it  we  will  be  able  to  see  why  an  external  source  of  constraint  is  required  if 
we  are  to  consider  one  categorization  more  natural  than  any  other. 

To  prove  the  Ugly  Duckling  Theorem,  let  us  consider  a  world  of  objects 
that  are  described  by  only  2  binary  predicates,  A  and  B  (Figure  2.2).  In 
this  case  the  predicates  axe  unconstrained  in  the  sense  that  A  and  B  carve 
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the  world  up  into  four  different  object  types,  . . .  a4,  corresponding  to  the 
the  logical  descriptions  of  {(A  fl  B),(A  H  fl  f?),(-iA  fl  -if?)}.  Now 

let  us  consider  the  question  of  how  many  properties  are  shared  by  any  two 
objects. 

First,  one  must  realize  that  although  there  are  only  two  starting  pred¬ 
icates,  there  are  many  composite  predicates,  and  each  such  predicate  is  a 
property  in  its  own  right.  In  fact,  every  combination  of  the  atomic  regions 
oti  is  an  allowable  predicate  or  property.  Let  us  define  the  “rank”  of  a  predi¬ 
cate  to  be  the  number  of  regions  or  object  types  (a, )  which  must  be  combined 
to  form  that  predicate.  For  example,  the  predicates  of  rank  1  are  exactly 
those  logical  combinations  given  above,  ai  defines  the  predicate  (A  fl  B) 
which  is  said  to  be  “true”  for  the  object  a\  and  “false”  for  objects  a2,  03, 
and  a*.  An  example  of  a  predicate  of  rank  2  is  (->A)  formed  by  the  union 
(03  U  a4).  An  interesting  predicate  of  rank  2  is  given  by  the  union  (a2  U  a3): 
the  logical  equivalent  is  the  exclusive-OR  (A®B).  The  exclusive-OR  must 
be  an  allowable  predicate:  if  A  corresponds  to  “blind  in  the  left  eye”  and  B 
corresponds  to  “blind  in  the  right  eye,”  then  (Aigif?)  is  the  predicate  “blind 
in  one  eye,”  a  perfectly  plausible  property.  Since  all  possible  combinations 
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of  regions  are  permitted  to  form  predicates  (if  one  allows  the  null  predicate 
which  is  false  for  all  objects,  and  the  identity  predicate  which  is  true  for  all 
objects)  there  are  24  =  16  possible  predicates  defined  in  our  simple  world  of 
two  starting  predicates. 

We  can  arrange  these  predicates  in  a  “truth”  lattice  as  shown  in  figure 
Figure  2.3.  The  lattice  is  layered  by  rank  and  connected  such  that  a  straight 
line  indicates  implication  from  the  lower  rank  to  the  higher  one.  For  ex¬ 
ample  (A  D  B)  implies  A  which  in  turn  implies  (.4  U  Notice  that  the 

rank  1  predicates  correspond  to  each  of  the  different  possible  objects.  The 
properties  which  are  true  for  an  object  may  be  found  by  following  all  up¬ 
ward  connections  from  that  object’s  node;  similarly,  any  node  in  the  lattice 
accessible  from  two  different  objects  represents  a  property  shared  by  those 
objects. 

Now,  the  important  question  is  how  many  properties  are  shared  by  any 
two  objects.  Given  the  symmetry  of  the  lattice  is  should  not  be  surprising 
that  each  of  the  objects  shares  exactly  4  properties  with  each  of  the  other 
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objects.1  If  we  consider  the  complete  set  of  possible  properties,  then  any  two 
objects  have  exactly  the  same  number  of  properties  in  common.  Thus  any 
similarity  metric  based  upon  on  the  number  of  common  properties  would 
assign  an  equal  similarity  to  all  object  pairs.  Given  this  state  of  affairs,  it 
would  not  be  plausible  to  consider  any  one  categorization  of  objects,  any  one 
grouping  of  instances  according  to  some  similar  properties,  as  more  natural 
than  some  other. 

Yet,  most  observers  would  agree  that  a  dog  and  a  cat  sure  more  “similar” 
than  are  a  dog  and  a  television.  How  can  we  resolve  this  intuition  against 
the  theorem  of  the  Ugly  Duckling  (so  named  since  it  states  that  the  Ugly 
Duckling  is  as  similar  to  the  swan  as  is  any  other  duck)?  The  answer  must  lie 
in  somehow  restricting  the  set  of  properties  which  can  be  considered.  In  our 
simple  world  of  two  base  predicates  there  were  14  non-trivial  properties  which 
were  considered.  Under  this  description  all  objects  were  equally  similar.  If, 
however,  we  remove  certain  properties  from  consideration,  then  it  will  be  the 
case  that  some  pairs  of  objects  will  share  more  properties  than  others,  and  a 
similarity  metric  base  upon  shared  properties  will  yield  distinct  categories. 
How,  then,  can  we  decide  which  properties  to  remove  from  consideration? 

Unfortunately,  it  is  impossible  to  decide  which  properties  to  discard  sim¬ 
ply  on  syntactic  grounds,  that  is  without  consideration  to  their  meaning. 
Both  Goodman  [1951]  and  Watanbe  [1985]  provide  persuasive  arguments 
that  no  property  can  be  regarded  as  a  priori  more  primitive  or  more  ba¬ 
sic  than  any  other;  a  redefinition  of  terms  which  preserves  logical  structure 
but  changes  the  basic  vocabulary  can  always  cause  syntactically  complicated 
properties  to  become  simple,  and  simple  ones  to  become  complex.2  Also,  as 
with  the  example  of  “blind  in  one  eye,”  unusual  or  disjunctive  concepts  may 
be  just  as  sensible  as  those  defined  more  simply  in  a  given  vocabulary.  Thus, 
if  we  are  to  weight  some  properties  more  than  others,  we  must  have  an  ex¬ 
ternal  motivation  for  doing  so.  This  source  of  information  is  referred  to  as 


Watanabe  [1985]  extends  the  discussion  to  include  any  number  of  predicates.  In  general 
if  there  are  m  atoms,  where  an  atom  is  defined  by  an  indivisible  region  a,,  then  there  are 
2m  predicates  and  any  two  objects  share  2^m~2^  of  them.  This  result  is  valid  even  if  the 
starting  predicates  are  constrained,  e.g.  the  predicate  B  includes  A  such  that  A  implies  B. 
The  only  critical  assumption  is  that  the  vocabulary  used  to  describe  the  objects  partition 
the  world  into  a  finite  number  of  distinct  classes. 

2Though  see  Osherson  [1978]  on  some  syntactic  conditions  which  should  be  met  by  “nat¬ 
ural”  properties.  This  claim,  however,  is  controversial,  (see  [Keil,  1981;  Carey,  1986]) 
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“extra-logical”  by  Goodman. 

Let  us  once  again  consider  the  principle  of  natural  modes.  We  state  that 
objects  will  tend  to  cluster  along  dimensions  important  to  the  interaction 
between  objects  (organisms)  and  the  environment.  That  is,  we  claim  that 
if  we  restrict  the  properties  of  consideration  to  those  only  involved  with  an 
object’s  interaction  with  the  environment,  then  there  will  be  a  clustering 
of  objects  which  will  define  natural  categories.  Thus  our  external  source  of 
information,  our  oracle  which  decides  what  properties  should  be  considered, 
are  the  laws  of  the  physical  and  biological  world.  The  physical  constraints 
imposed  upon  objects  and  organisms  select  the  properties  of  objects  in  which 
natural  categories  are  defined. 

2.3.2  Induction  and  natural  kinds 

A  related  problem  of  philosophy  is  the  issue  of  natural  kinds.  As  an  illus¬ 
tration,  consider  an  example  similar  to  that  described  by  Quine  [1969]:  An 
explorer  arrives  on  an  uncharted  island,  and  meets  natives  never  before  vis¬ 
ited  by  “civilized”  men.  Being  an  amateur  linguist  the  explorer  attempts  to 
compile  a  dictionary  of  the  vocabulary  of  the  natives.  One  day,  while  ac¬ 
companying  the  explorer  on  a  trip  through  the  forest,  a  native  points  to  an 
area  where  a  rabbit  is  sleeping  beneath  a  tree  and  utters  the  word  “blugle.” 
The  explorer  writes  in  his  dictionary  that  “blugle”  means  “rabbit.”  Quine 
asks  how  does  the  explorer  know  that  the  native  is  referring  to  the  rabbit 
and  not  the  situation  rabbit-under-a-tree.  Even  if  the  explorer  could  test 
this  distinction  (say  by  pointing  to  another  rabbit,  perhaps  cooked,  and  an¬ 
nouncing  “blugle”  and  awaiting  the  response)  he  could  never  test  all  possible 
meanings  consistent  with  the  situation. 

Yet,  we  believe  the  explorer  is  probably  correct  in  his  conclusion,  and 
even  if  he  is  not  correct  on  his  first  attempt,  we  believe  that  he  will  probably 
be  correct  on  his  second  or  third  (perhaps  “blugle”  means  “sleeping”  or 
“cute,”  but  surely  it  does  not  mean  “small-furry-leg-shaped-piece-within-ten- 
meters-of-that-particular-tree”).  After  considering  how  it  is  possible  for  the 
the  explorer  is  likely  to  be  correct,  and  related  problems  such  as  why  people 
tend  to  agree  on  the  relative  similarities  between  objects,  Quine  concludes 
that  people  must  be  endowed  with  an  innate  “spacing  of  qualities”  [1969,  p. 
123].  Such  a  spacing  would  provide  people  with  a  standard  of  similarity  that 
permitted  convergence  of  their  descriptions  of  the  world.  An  innate  quality 
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space  is  an  example  of  extra-logical  constraint  being  provided  to  the  observer 
for  the  formation  of  object  categories. 

2.4  Natural  Object  Processes 

In  this  section  we  provide  a  brief  discussion  of  the  physical  basis  underlying 
the  natural  modes.  In  Bobick  and  Richards  [1986]  the  construct  of  an  object 
process,  is  proposed  as  a  model  of  the  processes  responsible  for  the  creation 
of  natural  modes.  An  object  process  represents  the  interaction  between  some 
generating  process  (which  actually  produces  objects)  and  the  constraints  of 
the  environment.  For  the  discussion  here  we  consider  some  of  the  physical 
evidence  for  natural  object  processes  responsible  for  the  natural  modes  and 
relate  those  processes  to  the  claim  of  Accessibility. 

2.4.1  Physical  basis  for  natural  modes 

We  have  made  a  claim  about  the  structure  of  the  natural  world:  objects 
cluster  in  dimensions  significant  to  their  interaction  with  the  environment. 
If  this  is  the  case,  then  there  must  be  underlying  physical  processes  which 
give  rise  to  these  clustered  distributions,  and  produce  these  natural  modes 
of  objects.  Therefore  we  should  be  able  to  find  evidence  in  the  world  of  such 
processes. 

Fortunately,  such  evidence  is  quite  abundant.  In  the  world  of  biologi¬ 
cal  objects,  the  fact  that  structures  must  evolve  from  previous  structures 
places  a  strong  constraint  on  the  forms  present  [Dumont  and  Robertson, 
1986;  Thompson,  1962].  An  interesting  observation  supporting  this  claim 
is  provided  by  Stebbins  and  Ayala  [1985]  who  noted  the  non- uniformity  in 
the  distribution  of  the  complexity  of  DNA.  As  Pentland  [1986]  has  noted, 
“evolution  repeats  its  solutions  whenever  possible,”  reducing  the  number  of 
occurring  natural  forms;  this  conclusion  was  also  reached  by  Walls  [1963] 
in  his  discussion  about  the  repeated  evolution  of  color  vision.  Additional 
support  for  principle  of  natural  modes  comes  from  the  field  of  evolutionary 
biology.  Mayr  [1984]  states: 

[The  biological  species]  concept  stresses  the  fact  that  species  consist 

of  populations  and  that  species  have  reality  and  an  internal  genetic 

cohesion  owing  to  the  historically  evolved  genetic  program  that  is 


shared  by  all  members  of  the  species. 

The  objective  existence  of  species  represents  a  structuring  of  the  world  in¬ 
dependent  of  any  particular  observer. 

Structure  in  the  physical  world  can  also  be  discovered  by  examining  the 
physical  processes  responsible  for  the  existence  of  many  forms.  Steven’s  anal¬ 
ysis  of  patterns  [1974]  is  an  example  of  constraint  imposed  by  the  physics 
of  matter  in  the  formation  of  structure;  the  fact  that  “interesting”  patterns 
emerge  is  an  example  of  natural  modes.  (See  also  Thompson  [1962].)  The 
work  by  vision  researchers  to  model  different  physical  processes  so  as  to  con¬ 
struct  representations  for  different  types  of  objects  is  plausible  only  because 
there  axe  limited  ways  for  nature  to  create  objects  [Pentland,  1986;  Kass  and 
Witkin,  1985;  Pentland,  1984].  Even  chaotic  systems  have  modes  of  behavior 
[Levi,  1986]. 

It  should  be  noted  that  man-made  objects  are  also  subject  to  constraints 
upon  form,  although  the  environmental  pressures  are  different.  For  example, 
a  chair  must  have  certain  geometric  properties  to  be  able  to  function  appro¬ 
priately.  It  must  allow  access  and  stability,  placing  significant  constraints  on 
its  shape.  A  table  must  have  a  flat  nearly  horizontal  surface  with  a  stable 
support  to  function  as  a  table.  An  even  more  complicated  set  of  constraints 
related  to  ease  of  manufacturing  and  peoples’  aesthetic  interests  operates 
on  most  constructed  objects.  Why  is  it  that  most  books  have  similar  as¬ 
pect  ratios?  The  common  visual  scene  of  “row  houses”  is  an  example  of 
structure  imposed  by  man  mimicking  the  type  of  natural  modes  produced 
by  nature.  For  a  more  extensive  discussion  about  constraints  on  the  shapes 
of  objects  and  the  non-axbitrary  nature  of  objects  see  [Winston,  et  al.,  1983; 
Lozano- Perez,  1985;  Thompson,  1961]. 


2.4.2  Observed  vs.  unobserved  properties 

It  is  important  to  relate  the  existence  of  natural  object  processes  to  the  claim 
of  Accessibility.  The  claim  of  Accessibility  states  that  some  of  the  proper¬ 
ties  important  to  an  object’s  interaction  with  the  environment  are  reflected 
in  observed  properties;  the  importance  of  this  claim  is  that  it  permits  us 
to  attempt  to  recover  the  natural  categories  from  the  observed  properties. 
In  light  of  the  discussion  about  natural  object  processes,  we  can  view  Ac¬ 
cessibility  as  independence  between  the  sensory  processes  and  the  processes 
responsible  for  the  structure  of  an  object.  Because  the  distinction  between 


I 


observed  and  unobserved  properties  occurs  only  because  of  the  sensory  ap¬ 
paratus,  we  can  look  for  natural  modes  in  only  the  observed  properties  and 
assume  that  the  modal  behavior  of  the  unobserved  properties  will  follow. 
Because  most  of  the  data  provided  to  the  observer  axe  observed  properties, 
this  dissociation  between  observed  and  unobserved  properties  is  essential  for 
recovering  natural  categories. 

2.5  Psychological  Evidence  for  Natural 
Categories 

Until  now,  our  arguments  for  the  existence  of  natural  modes  have  rested 
on  evidence  from  the  world  itself.  In  particular  we  have  claimed  that  the 
physics  of  our  world,  including  the  evolutionary  pressures  of  the  environment, 
cause  objects  to  have  non-arbitrary  configurations.  However,  if  it  is  the  case 
that  it  makes  sense  to  describe  our  world  as  having  natural  categories,  and, 
as  we  have  claimed,  that  describing  the  world  in  terms  of  these  categories 
permits  one  to  make  useful  inferences  about  objects,  then  we  might  expect 
these  categories  to  be  manifest  in  the  psychology  of  organisms  that  make 
such  inferences.  That  is,  we  should  be  able  to  detect  the  presence  of  natural 
categories  in  the  mental  organization  of  the  world  used  by  different  perceiving 
organisms.  Notice  that  the  existence  of  mental  categories  does  not  imply  the 
existence  of  categories  in  the  world,  only  that  the  world  is  structured  in  such 
a  way  as  to  permit  the  formation  of  visual  categories  which  are  useful  to 
observer.  Therefore  the  ability  to  create  such  a  categorization  is  a  necessary 
condition  for  the  expression  of  natural  modes  in  observable  properties. 

In  fact,  a  wealth  of  literature  exists  attesting  to  the  psychological  real¬ 
ity  of  natural  categories.  Evidence  may  be  found  in  both  cognitive  science 
and  animal  psychology.  In  particular  the  interaction  between  natural  cat¬ 
egories  and  perceptual  recognition  tasks  has  been  extensively  investigated. 
We  present  a  brief  review  of  the  relevant  literature,  especially  as  relates  to 
object  perception. 

2.5.1  Basic  level  categories 

In  1976,  Eleanor  Rosch  and  her  colleagues  published  what  has  become  a  clas¬ 
sic  paper  in  the  field  of  cognitive  psychology  [Rosch,  Mervis,  Gray,  Johnson, 
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and  Boyes-Braem,  1976]. 

The  principal  finding  of  that  work  was  that  people  tend  to  categorize  ob¬ 
jects  in  the  world  at  a  one  particular  abstract  taxonomic  level.  This  level  is 
operationally  defined  as  the  level  at  which  categories  have  many  well  defined 
attributes  but  at  which  there  is  little  overlap  of  the  attributes  with  those  of 
other  categories.  As  an  example  consider  the  simple  taxonomic  relation  of 
“fruit  — »  apple  — ►  Mclntosh-apple”  where  x  — *  y  means  x  includes  y.  In  this 
case,  Rosch  et  al.  demonstrated  that  the  preferred  level  of  description  is  “ap¬ 
ple.”  The  reason  for  this  was  given  to  be  that  few  attributes  can  be  assigned 
to  “fruit”  relative  to  the  number  of  attributes  assignable  to  “apple,”  while  the 
lower  level  category  “Mclntosh-apple”  is  a  category  whose  atrributes  overlap 
extensively  with  other  lower  level  categories  such  as  “Delicious-apple.”  The 
basic  level,  in  this  case  “apple”,  is  that  taxonomic  level  at  which  category 
members  have  a  well  defined  structure  (in  Rosch ’s  concrete  noun  examples 
we  explicitly  mean  physical  structure)  and  at  which  there  were  no  other  cat¬ 
egories  that  significantly  share  that  structure.  Perhaps  the  most  important 
aspect  of  the  work  by  Rosch,  et  al.  was  the  demonstration  that  categories  at 
the  basic  level  appear  to  be  more  accessible  for  a  variety  of  cognitive  tasks 
(presently  we  will  consider  the  interaction  between  basic  level  categories  and 
the  perceptual  task  of  object  recognition),  indicating  that  these  categories 
enjoy  some  special  psychological  status.  That  is,  is  there  strong  evidence 
that  these  categories  have  some  degree  of  psychological  reality. 

Several  attempts  have  been  made  to  formally  define  basic  level  categories 
in  terms  of  attributes  and  categories;  this  thesis  implicitly  contains  one  such 
attempt.  Let  us  postpone  the  discussion  of  these  theories  until  chapter  3 
where  a  review  of  the  various  disciplines  which  have  addressed  the  catego¬ 
rization  problem  —  these  include  cognitive  psychology,  pattern  recognition 
and  machine  learning  —  is  presented.  For  now,  the  important  point  is  that 
there  exists  empirical  evidence  of  a  particular  set  of  categories  being  used  to 
describe  objects  in  the  world. 

One  of  the  cognitive  operations  in  which  basic  level  categories  show  a 
marked  superiority  is  that  of  object  recognition,  whether  the  actual  task  be  a 
speed  of  naming  task  [Rosch,  et  al.  1976;  Murphy  and  Smith,  1982;  Jolicoeur, 
Gluck,  and  Kosslyn,  1984;]  or  a  confirmation  task  where  the  subject  is  primed 
with  the  name  of  a  category  and  has  to  decide  whether  a  picture  of  an  object 
belongs  to  that  category  (see  the  analysis  of  Potter  and  Faulconer  [1975] 
given  in  Jolicoeur,  et  al.  [1984]).  These  findings  are  of  particular  interest  here 
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because  the  principal  problem  addressed  by  this  thesis  is  that  of  categorizing 
objects  into  classes  suitable  for  the  recognition.  Specifically,  we  would  like 
to  know  whether  basic  level  categories  are  special  in  a  perceptual  sense  as 
opposed  to  simply  being  more  easily  accessed  as  concepts  by  some  cognitive 
process. 

To  address  this  question,  Murphy  and  Smith  [1982]  designed  an  artifi¬ 
cial  world  in  which  to  test  the  perceptual  superiority  of  basic  level  cate¬ 
gories.  By  using  artificially  created  superordinate,  basic,  and  subordinate 
categories,  they  were  able  to  control  factors  such  as  word  frequency,  order 
of  learning,  and  length  of  linguistic  description  (real  basic  level  categories 
tend  to  have  simple  one  word  labels).  These  factors  were  considered  to  be 
possible  confounding  factors  in  the  results  originally  reported  by  Rosch,  et 
al.  [1976].  Murphy  and  Smith  did  indeed  replicate  the  finding  that  objects 
can  be  categorized  fastest  at  the  basic  level.  They  attributed  this  superiority 
to  the  fact  that  basic  level  categories  have  more  perceptual  structure  than 
superordinate  categories,  while  at  the  same  time  having  many  discriminat¬ 
ing  attributes  from  other  basic  level  categories.  Because  these  were  artificial 
objects,  Murphy  and  Smith  were  able  to  claim  that  the  advantage  demon¬ 
strated  by  the  basic  level  categories  in  the  task  of  recognition  was  caused  by 
a  purely  perceptual  mechanism. 

Jolicoeur,  et  al.  [1984]  extended  the  work  of  Murphy  rind  Smith.  Mur¬ 
phy  and  Smith  [1982]  postulated  that  categorizing  objects  as  belonging  to 
superordinate  categories  was  difficult  (slower)  because  of  the  disjointedness 
of  the  perceptual  structure.  For  example,  to  test  if  an  object  is  a  fruit  would 
require  matching  the  incoming  stimulus  to  a  highly  disjunctive  perceptual 
model  (something  that  would  match  either  a  banana  or  an  apple).  Jolicoeur, 
et  al.  make  the  stronger  claim  that  that  superordinate  and  subordinate  cate¬ 
gorizations  are  slower  because  object  recognition  first  takes  place  at  the  basic 
level ,  and  then  further  processing  is  required  to  determine  the  superordi¬ 
nate  or  subordinate  category.  For  example,  if  the  task  requires  determining 
whether  an  object  is  a  fruit,  then  when  presented  with  an  image  of  an  ap¬ 
ple,  the  subject  would  first  recognize  the  object  as  an  “apple,”  and  then  use 
semantic  information  to  conclude  that  it  is  indeed  a  “fruit.”  Similarly,  if 
attempting  to  categorize  at  the  subordinate  level,  the  subject  would  again 
first  determine  the  basic  category  and  then  compute  the  necessary  addi¬ 
tional  perceptual  information  required  to  determine  the  subordinate  level, 
e.g.  “McIntosh.” 
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To  test  this  hypothesis,  Jolicoeur,  et  al.  considered  the  correlation  be¬ 
tween  latencies  in  both  perceptual  and  non- perceptual  tasks.  In  one  exper¬ 
iment  they  discovered  that  the  time  to  name  the  superordinate  category  of 
an  object  when  presented  with  its  image  correlated  well  with  the  time  to 
name  the  superordinate  category  given  the  word  describing  an  objects  basic 
category.  For  example,  the  latencies  measured  when  subjects  were  given  the 
word  “apple”  and  required  to  announce  “fruit”  behaved  similarly  to  those 
latencies  recorded  when  subjects  were  presented  with  a  picture  of  an  apple. 
One  possible  interpretation  of  this  result  is  that  that  some  words  are  inher¬ 
ently  easier  to  access  than  others.  To  rule  out  this  possibility,  correlations 
were  checked  for  items  within  the  same  superordinate  category;  both  “apple” 
and  “banana”  require  the  response  “fruit.”  For  each  such  item  the  correct 
superordinate  response  is  identical,  allowing  us  to  remove  the  effect  of  the 
degree  of  difficulty  in  making  the  response.  Here  too  the  latency  of  the  per¬ 
ceptual  task  correlated  well  with  the  latency  of  the  linguistic  task.  Thus  the 
superordinate  categorization  data  support  the  claim  that  perceptual  access 
does  indeed  occur  at  the  basic  level. 

Jolicoeur,  et  al.  [1984]  performed  a  second  experiment  to  test  the  claim 
that  objects  were  accessed  at  the  basic  level.  Recall  that  under  this  hy¬ 
pothesis  additional  perceptual  processing  beyond  basic  level  is  required  only 
for  subordinate  categorization.  Superordinate  identification  required  only 
semantic  information  (e.g.  knowledge  that  an  apple  is  a  fruit).  Thus  one 
would  expect  a  differential  effect  between  the  latencies  (and  error  rates)  of 
identification  for  subordinate  and  superordinate  categories  as  one  varied  the 
the  duration  of  exposure  to  the  perceptual  stimulus.  In  fact,  such  a  dif¬ 
ferential  effect  was  found:  reducing  exposure  times  from  1  sec.  to  75  msec, 
produced  no  effect  on  the  latencies  to  name  superordinate  categories  but 
produced  a  large  increase  in  the  time  required  to  name  the  subordinate  cate¬ 
gory.  Thus,  the  subordinate  categorization  data  also  support  the  claim  that 
object  recognition  first  occurs  at  the  basic  level. 

In  summary,  cognitive  psychology  provides  evidence  that  people  make 
use  of  a  particular  categorization  of  the  world  in  a  variety  of  cognitive  tasks. 
These  basic  level  categories  occurred  at  the  taxonomic  level  at  which  objects 
possessed  a  high  degree  of  structure  while  minimizing  category  overlap;  this 
condition  is  equivalent  to  stating  that  knowledge  of  an  object’s  basic  level 
category  would  permit  many  inferences  about  the  objects  properties,  while 
identifying  an  object’s  category  would  be  reliable  given  the  minimal  overlap 
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with  other  basic  level  categories.  While  the  existence  of  these  categories  does 
not  necessarily  (in  the  logical  sense)  imply  the  existence  of  natural  categories 
in  the  world,  it  does  support  the  view  that  the  world  is  structured  in  such 
a  way  as  to  make  a  categorical  description  useful  a  variety  of  tasks.  The 
work  demonstrating  that  object  recognition  first  takes  place  at  the  basic 
level  supports  our  claim  in  section  2.2  that  categories  which  would  be  useful 
for  making  reliable  inferences  about  objects  are  the  appropriate  categories 
for  recognition. 

2.5.2  Animal  psychology 

If  the  structure  of  the  world  is  such  that  there  exists  a  categorization  which 
is  natural  for  recognition  (would  permit  reliable  inferences  about  objects) 
then  it  should  be  the  case  that  other  organisms  in  the  same  world  would 
also  exhibit  such  a  categorization  in  their  psychologies.  Therefore  let  us 
consider  the  work  performed  with  animals  in  trying  to  establish  which  set  of 
categories  they  possess.  Unfortunately  one  is  limited  in  the  types  of  tasks  one 
can  require  an  animal  to  do,  and  most  conclusions  about  animals’  categories 
are  based  on  how  well  and  how  quickly  they  learn  to  discriminate  various 
sets  of  stimuli.  Nevertheless,  interesting  results  about  the  categorization  of 
objects  used  by  animals  have  been  reported.  Hernstein  [1982]  provides  an 
excellent  review  of  the  studies  of  animals’  categories. 

Cerella  [1979]  studied  the  ability  of  pigeons  to  learn  to  discriminate  white- 
oak  leaves  from  other  types  of  leaves.  After  learning  to  perfectly  discriminate 
40  white-oak  leaves  from  other  leaves,  the  pigeons  were  able  to  generalize 
to  40  new  instances  of  white-oak  leaves.  Such  results  suggest  that  the  pi¬ 
geons  acquired  a  “category”  corresponding  to  white-oak  leaves.  Cerella  then 
trained  pigeons  using  40  non-oak  leaves  and  one  white-oak  leaf,  repeated 
many  times;  he  then  tested  these  pigeons  with  probes  including  40  differ¬ 
ent  white-oak  leaves.  Still,  with  only  having  seen  one  white-oak  leaf,  the 
pigeons  were  able  to  successfully  discriminate  between  white-oak  leaves  and 
other  leaves.  This  remarkable  finding  suggests  that  not  only  do  the  pigeons 
form  a  category  corresponding  to  the  white-oak  leaves,  they  also  extract 
the  attributes  necessary  to  distinguish  the  “natural”  category  white-oak  leaf 
from  other  leaves.  This  type  of  learning  provides  powerful  evidence  that 
the  world  is  clustered  in  recoverable  natural  modes:  an  organism’s  percep¬ 
tual  processes  are  tuned  to  be  sensitive  to  the  attributes  of  objects  that  are 
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constrained  by  the  processes  responsible  for  the  object’s  formation.  As  in 
the  experiments  reported  by  Gould  and  Marler  [1987]  concerning  the  role  of 
instinct  in  animal  learning,  these  results  underscore  the  importance  of  pro¬ 
viding  the  organism  with  the  necessary  underlying  theory  of  structure  if  the 
organism  is  to  successfully  interact  with  its  environment. 

Hernstein  and  de  Villiers  [1980]  tested  the  ability  of  pigeons  to  learn  the 
“natural”  category  of  fish.  One  of  the  reasons  they  chose  fish  is  that  fish  are 
not  part  of  the  natural  habitat  of  pigeons  and  thus  their  prior  experience 
could  not  influence  the  results.  Their  training  stimulus  set  consisted  of  SO 
under-water  photographs,  40  which  contained  fish  (in  various  orientations 
and  occlusion  relations)  and  40  which  did  not;  the  negative  examples  did 
contain  images  of  other  creatures  such  as  turtles  and  human  divers.  Pigeons 
rapidly  learned  to  discriminate  between  the  two  sets  of  images,  reaching 
a  rate  of  discrimination  comparable  to  that  of  experiments  using  objects 
normally  found  in  the  pigeons  habitat  such  as  trees  and  people  [Hernstein, 
Loveland,  and  Cable,  1976].  When  tested  on  novel  pictures,  all  the  pigeons 
generalized  in  at  least  some  of  the  tests.  Another  set  of  pigeons  was  trained 
using  the  same  stimuli,  but,  in  this  case  the  pictures  were  divided  randomly. 
The  pigeons  were  unable  to  achieve  a  discrimination  ability  comparable  to 
the  fish  versus  non-fish  group  and  any  ability  they  did  acquire  took  longer  to 
achieve.  Thus  we  may  take  these  findings  to  suggest  that  pigeons  developed 
the  “natural”  category  of  “fish.”  The  interesting  aspect  of  this  result  is  that 
fish  are  not  part  of  the  environment  normally  experienced  by  pigeons  nor 
have  they  been  so  for  50  million  years.  Therefore  it  is  unlikely  that  the 
genetic  experience  of  the  species  would  encode  the  category  “fish.”  Thus 
we  can  assume  that  there  is  something  about  the  general  perceptual  process 
of  the  pigeons  which  makes  “fish”  a  natural  category.  This  is  analogous 
to  Quine’s  [1969]  innate  quality  space  mentioned  in  section  2.3.  The  fact 
that  the  innate  quality  space  of  pigeons  —  an  organism  unfamiliar  with  the 
aquatic  environment  —  would  lead  to  the  formation  of  a  category  “fish”  is 
additional  evidence  that  natural  modes  exist  in  the  world  and  that  they  are 
perceptually  recoverable. 


Chapter  3 
Previous  Work 


Although  the  problem  of  categorization  addressed  in  this  thesis  is  one  of 
psychology  —  how  do  people  organize  their  representation  of  objects  with 
respect  to  recognition  —  the  general  problem  of  discovering  “natural”  or 
important  classes  in  a  collection  of  instances  can  be  found  in  many  branches 
of  science.  Particularly  relevant  here  are  the  following  three  disciplines:  1) 
cognitive  science,  in  which  several  attempts  have  been  made  to  formalize 
the  concept  of  basic  level  categories;  2)  cluster  analysis,  the  study  of  the 
automated  partitioning  of  numerical  data  into  meaningful  classes;  and  3) 
machine  learning,  a  subfield  of  artificial  intelligence  which  considers  the  is¬ 
sues  involved  in  producing  a  machine  which  can  learn  about  structure  in  its 
environment.  The  scope  of  this  chapter  precludes  giving  a  thorough  descrip¬ 
tion  of  all  the  relevant  work  contained  in  these  disciplines;  several  complete 
books  have  been  dedicated  to  each.  As  such,  we  will  present  a  brief  de¬ 
scription  of  the  important  contributions  which  relate  directly  to  the  problem 
addressed  by  this  thesis:  discovering  a  set  of  categories  that  are  useful  for 
recognition  in  terms  of  permitting  reliable  inferences  about  an  object’s  prop¬ 
erties.  The  reader  is  referred  to  [Smith  and  Medin,  1981],  [Anderberg,  1973], 
and  [Michalski,  et  al.,  1983;  Michalski,  et  al.,  1986]  for  references  giving  more 
detailed  analyses. 
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3.1  Cognitive  Science 

In  section  2.5.1  we  referred  to  the  work  on  basic  level  categories  as  evidence 
for  the  existence  of  natural  categories  in  human  mental  representation;  we 
now  consider  that  work  in  terms  of  its  theoretical  development.  Cognitive 
scientists  have  attempted  to  formalize  the  definition  of  basic  level  categories 
in  terms  features  and  their  distributions.  Because  these  categories  display 
the  desirable  properties  discussed  in  chapter  2  —  they  are  highly  structured 
permitting  many  inferences  to  be  made  about  the  properties  of  objects  con¬ 
tained  in  those  categories,  and  they  are  quite  dissimilar  to  one  another  mak¬ 
ing  classification  more  reliable  —  it  is  important  to  understand  these  prior 
attempts  to  specify  basic  categories.  We  will  then  draw  upon  some  of  them 
in  our  own  development  of  a  categorization  metric  in  chapter  4. 

In  the  original  work  of  Rosch,  et  a!.  [1976]  basic  level  categories  are 
described  as  the  taxonomic  level  which  maximized  the  cue  validity  of  a  cat¬ 
egory.  As  used  by  Rosch,  et  al.,  the  cue  validity  of  a  feature  for  a  category 
is  a  psychological  quantity  which  measures  how  indicative  a  certain  feature 
would  be  of  some  category.  The  cue  validity  of  a  category  is  defined  to  be 
the  sum  of  the  cue  validities  of  the  various  features  or  attributes  true  of  the 
objects  in  that  category.  For  example,  the  cue  validity  of  feathers  would  be 
very  high  for  the  category  “birds,”  but  less  so  for  “ducks”  since  many  ob¬ 
jects  which  have  feathers  are  not  ducks.  Likewise  for  the  features  “wings”, 
“beaks”,  and  “lays  eggs”.  To  consider  whether  basic  level  categories  can  be 
defined  in  terms  of  cue  validity  we  need  to  provide  a  formal  description  of 
that  psychological  quantity. 

The  most  common  formal  definition  of  cue  validity  is  that  of  conditional 
probability .'  That  is,  the  cue  validity  of  some  feature  /<  for  a  category  C ,  is 


'Unfortunately,  the  term  “cue  validity”  has  more  than  one  formal  definition  in  the  cognitive 
science  literature.  Conditional  probability  is  the  interpretation  taken  by  Smith  and  Medin 
[1981]  and  Murphy  [1982],  though  the  formulation  provided  by  Smith  and  Medin  (p.  79)  is 
mathematically  incorrect.  The  r  v  validity  to  which  Rosch,  et  al.  refer  is  probably  based 
upon  the  definition  provided  by  Beach  [1964]  and  Reed  [1972].  In  their  formulation,  the 
cue  validity  of  a  feature  for  a  category  is  calculated  by  considering  both  the  frequency  of 
occurrence  a  feature  (averaged  over  all  categories)  and  its  diagnostic  value  in  identifying 
that  category.  Let  p  be  inversely  proportional  to  the  cver-all  frequency  of  occurrence 
of  some  feature  /,.  Then,  in  this  formulation,  the  cue  validity  of  feature  fi  for  some 
category  C,  is  equal  to  p-  (prior  probability  Cj )  +  (1  -  p)  •  (conditional  probability  C;j/<)- 
This  formulation  was  provided  to  explain  the  psychological  phenomenum  that  subjects 


taken  to  be  the  conditional  probability  that  some  object  is  a  member  of  C,  if 
it  is  known  that  feature  /,  is  true  of  that  object.  (We  assume  for  now  that  a 
feature  is  either  true  or  false  for  any  given  object.)  A  simple  formula  for  the 
conditional  probability  can  be  given  in  terms  of  the  frequency  of  occurrences 
of  a  feature  in  different  categories.  Let  us  assume  that  Na  is  the  number  of 
objects  in  the  category  Ca  for  which  some  feature  f  is  true.  Likewise  for 
Nb,  and  for  now  we  can  assume  that  there  are  only  two  categories.  Then, 
if  we  assume  that  the  number  of  occurrences  can  be  used  to  estimate  the 
underlying  probability  distributions,  then  simple  probability  theory  yields: 

fi  for  category  Ca~  '  °  '  ~  Na  +  Nk 

If  there  are  more  than  two  categories,  then  the  additional  occurrences  of 
the  feature  in  those  categories  are  simply  added  to  the  denominator.  The 
denominator  is  simply  the  total  number  of  objects  in  the  world  exhibiting  the 
feature;  it  remains  constant  regardless  of  the  number  or  type  of  categories 
into  which  the  world  is  partitioned.  With  this  definition  in  hand,  we  can 
now  consider  whether  basic  level  categories  can  indeed  be  defined  in  terms 
of  cue  validity. 

As  Murphy  [1982]  has  noted,  maximizing  cue  validity  cannot  be  the  basis 
for  basic  level  categories.  A  simple  example  will  quickly  demonstrate  this 
fact.  Following  Murphy,- ,lgt  us  consider  the  taxonomic  hierarchy  of  “physical- 
object”,  “animal”,  “bird”,  and  “duck”,  and  let  us  examine  the  cue  validity  of 
the  feature  “has-wings”  with  respect  to  this  hierarchy.  Again,  define  Nphy, 
to  be  the  number  of  “physical-objects”  for  which  the  feature  “has-wings”  is 
true.  Similarly  for  Nanimai,  Nbird,  and  Nduck.  By  definition,  Nphyt  >  Nanimai  > 
Nbird  >  Nduck ■  Therefore,  since  the  denominator  in  the  expression  for  cue 
validity  remains  constant  regardless  of  the  partitioning  of  the  objects,  it  must 
also  be  the  case  that  the  cue  validity  for  “has-wings”  increases  as  one  moves 
up  the  taxonomic  hierarchy.  This  agrees  with  the  intuition  that  if  p  is  the 
probability  that  some  object  is  a  “bird”  given  that  one  knows  some  feature 
about  that  object,  then  the  probability  that  it  is  an  “animal”  should  be  at 
least  p.  Since  the  cue  validity  of  any  feature  for  a  category  increases  as  the 


category  becomes  more  inclusive,  the  most  inclusive  category  would  be  the 
level  which  maximized  total  cue  validity.  The  basic  level  categories  would 
possess  lower  cue  validities  than  the  most  general  category  “thing.” 

The  underlying  reason  that  cue  validity  cannot  be  used  to  define  basic 
level  categories  is  that  cue  validity  contains  no  consideration  of  the  density  of 
a  feature  within  a  category.  That  is,  only  the  conditional  probability  of  the 
category  given  the  feature  is  measured,  ignoring  the  likelihood  that  a  given 
category  member  contains  that  feature.  This  extra  component  is  included  in 
the  collocation  measure  proposed  by  Jones  [1983].  Using  the  above  notation, 
the  collocation  of  a  category  and  a  feature  Kc},f,  is  defined  by: 

KC).f.  =  PiCM)  ■  P(f>\C}) 

The  first  term  is  the  conditional  probability  corresponding  to  the  cue  va¬ 
lidity  discussed  above.  The  second  term,  however,  reflects  the  density  of  a 
feature  in  the  category.  Because  the  collocation  is  a  product  of  these  two 
probabilities,  it’s  value  can  be  large  only  when  both  terms  are  large.  Though 
the  first  term  (cue  validity)  grows  as  categories  become  more  inclusive,  the 
second  term  is  diminished  when  a  category  becomes  less  homogeneous.  Thus 
the  maximum  of  this  function  will  occur  at  some  intermediate  depth  in  the 
taxonomic  hierarchy.  Jones  argues  that  the  basic  level  categories  occur  at 
the  taxonomic  level  which  tend  to  maximize  the  collocation  as  measured  over 
all  the  features. 

A  simple  example  will  illustrate  the  properties  of  the  collocation  measure 
and  how  it  relates  to  basic  level  categories.  After  Jones  [1983],  suppose 
we  have  the  feature  can -fly  and  the  hierarchy  “duck”,  “bird”,  “animal”. 
Suppose  there  are  10  instances  of  “duck”,  all  of  which  can  fly,  90  other 
instances  of  “bird”,  80  of  which  can  fly  (allowing  for  some  non-flying  birds) 
and  900  additional  instances  of  “animal”  of  which  10  can  fly  (for  animals 
such  as  bats).  If  we  assume  that  the  occurrences  can  be  used  to  estimate  the 
probabilities,  then  we  can  compute  the  following  collocations:  K duck, can -jiy  = 
.10,  Kb, rd, can-fly  =  .81,  K animal, can -fly  =  -10.  Thus,  the  collocation  measure 
attains  a  maximum  at  the  basic  level  (“bird”). 

Jones  [1983]  proposed  a  particular  method  for  converting  raw  collocation 
measures  into  an  index  measuring  the  degree  to  which  a  category  is  basic. 
This  construction  can  only  evaluate  one  category  with  respect  to  the  other 
categories  of  some  categorization.  It  does  not  readily  permit  one  to  compare 
one  set  of  categories  to  another,  making  it  inadequate  for  the  task  of  selecting 
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an  appropriate  set  of  categories.  Also,  there  is  a  question  as  to  whether  the 
“degree  to  which  a  category  is  basic”  is  a  meaningful  quantity.  However,  the 
basic  principal  of  combining  two  terms  reflecting  the  diagnosticity  of  features 
and  the  homogeneity  of  categories  (here  expressed  as  conditional  probabil¬ 
ities)  is  consistent  with  the  goals  of  categorization  proposed  in  chapter  2. 
A  high  degree  of  homogeneity  in  a  category  permits  an  observer  to  infer 
features  (properties)  of  an  object  once  its  category  is  known,  and  a  high  di¬ 
agnosticity  of  a  feature  for  some  category  makes  correct  categorization  easier 
and  more  reliable.  In  chapter  4,  where  we  develop  a  measure  of  the  utility  of 
a  set  of  categories  for  recognition,  we  will  return  to  this  discussion  of  these 
two,  in  some  sense  opposing,  goals. 

It  should  be  noted  that  in  terms  of  being  useful  for  our  purposes  of  creat¬ 
ing  a  categorization  which  is  suitable  for  recognition,  there  is  a  fundamental 
difficulty  with  the  collocation  measure:  the  relative  weights  of  the  two  prob¬ 
abilities  are  arbitrarily  set  to  be  equal.  To  examine  this  issue  more  closely, 
consider  the  following  modified  version  of  collocation: 

K'CiJi  =  PfC,!/.)1  •  PMICj)"-*' 

In  K',  the  exponent  A,  (0  <  A  <  1)  reflects  the  relative  contribution  of  the 
ability  to  infer  an  object’s  category  given  its  features  (expressed  as  the  condi¬ 
tional  probability  P(Cj\f,)  )  as  compared  to  the  ability  to  predict  an  object’s 
features  given  its  category  ( P(f,\Cj )).  Such  a  relative  weight  is  necessary  if 
we  are  to  use  this  measure  to  help  select  a  categorization  appropriate  for 
recognition.  The  observer  needs  to  be  able  to  trade-off  how  much  informa¬ 
tion  about  an  object  he  needs  to  infer  from  the  category  against  how  difficult 
it  is  to  identify  an  object’s  category  from  its  features.  Without  such  a  pa¬ 
rameter,  the  categories  that  the  collocation  measure  will  select  as  basic  or 
fundamental  will  be  completely  determined  by  the  distribution  of  features 
which  the  observer  measures;  the  goals  of  the  observer  cannot  be  used  to 
constrain  the  selected  categories.  In  chapter  4  we  will  introduce  an  explicit 
parameter  which  represents  this  trade-off. 

Finally,  we  should  mention  the  work  of  Tversky  [1977].  In  that  seminal 
paper,  Tversky  constructs  a  contrast  model  of  similarity;  it  is  so  termed  be¬ 
cause  the  similarity  between  two  objects  depends  on  not  only  the  features 
they  have  in  common,  but  also  the  (contrast)  features  they  do  not.  By 
further  introducing  an  asymmetry  in  the  manner  in  which  two  objects  are 
compared,  Tversky  is  able  to  explain  the  empirical  finding  that  similarity 
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is  not  psychologically  symmetric.  For  example,  most  subjects  rated  North 
Korea  as  more  similar  to  China  than  China  was  to  North  Korea  (Remem¬ 
ber  these  data  were  recorded  in  1976!).  The  aspect  of  the  contrast  model 
theory  which  is  relative  to  our  discussion  is  Tversky’s  proposal  of  using  his 
similarity  metric  as  a  measure  sensitive  to  the  basic  level  categories.  Tver- 
sky  realized  that  a  measure  which  only  considered  the  similarity  between 
members  within  a  category  would  select  minimal  categories  (the  opposite  of 
cue  validity);  categories  would  tend  to  become  more  homogeneous  as  they 
were  refined.  To  compensate  for  this  deficiency  Tversky  suggested  creating 
a  measure  to  select  basic  categories  by  multiplying  the  average  similarity 
between  objects  in  a  category  by  a  weighting  factor  which  increased  with 
category  size.  This  product  would  then  behave  in  a  fashion  somewhat  anal¬ 
ogous  to  that  of  collocation.  However,  a  weight  based  upon  category  size 
must  be  viewed  as  an  ad  hoc  solution;  the  number  of  objects  contained  in  a 
category  should  not  determine  whether  that  category  is  at  the  basic  level  in 
a  taxonomy. 

We  should  note  that  cognitive  science  has  not  addressed  the  question  of 
how  basic  level  categories  are  acquired.  That  is,  even  if  one  has  a  measure 
which  is  sensitive  to  the  basic  level  of  a  taxonomy,  one  cannot  recover  the 
basic  level  categories  unless  a  taxonomy  is  provided.  Arguing  that  a  taxon¬ 
omy  is  provided  through  instruction  (objects  are  placed  in  a  hierarchy  by 
teachers)  seems  to  be  an  untenable  position;  otherwise,  one  would  have  to 
believe  that  in  the  absence  of  instruction  basic  level  categories  would  not  be 
formed.  Also,  the  fact  that  animals  form  “natural”  categories  about  objects 
with  which  they  (and  their  ancestors  of  50  million  years)  have  had  no  expe¬ 
rience  [Hernstein  and  de  Villiers,  1980]  argues  against  taxonomies  provided 
by  instruction. 

In  summary,  we  can  conclude  that  a  measure  which  is  sensitive  to  basic 
level  categories  must  contain  at  least  two  components.  These  components 
should  reflect  not  only  the  similarity  within  a  category,  but  also  the  dissim¬ 
ilarity  between  categories.  (In  the  next  section  we  will  see  that  these  two 
components  are  key  to  many  cluster  analysis  algorithms.)  However,  provid¬ 
ing  a  measure  which  can  indicate  basic  or  natural  categories  is  only  part  of 
the  categorization  problem.  The  issue  of  how  one  discovers  these  categories, 
of  how  one  hypothesizes  about  which  categories  are  natural,  must  also  be 
addressed. 
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3.2  Cluster  Analysis 

One  of  the  common  problems  encountered  in  science  is  that  of  generating  a 
reasonable  interpretation  and  explanation  of  a  collection  of  data.  Whether 
the  data  consist  of  various  astronomical  recordings  [Wishart,  1969]  or  of  the 
descriptions  of  several  diseased  soy  bean  plants  [Michalski  and  Stepp,  1983b], 
a  basic  step  in  the  analysis  of  the  data  is  to  appropriately  group  the  data 
into  meaningful  pieces.  In  the  case  of  the  astronomical  data,  the  spectral- 
luminosity  profiles  are  grouped  in  such  a  way  so  as  to  identify  four  classes  of 
stairs:  giants,  super-giants,  main  sequence  stars,  and  dwarfs.  This  grouping 
process  —  segmenting  the  data  into  classes  which  share  some  underlying 
process  —  is  often  the  most  important  and  yet  the  most  difficult  step  in  any 
experimental  science. 

Cluster  analysis  (sometimes  referred  to  as  unsupervised  learning)  is  the 
study  of  automated  procedures  for  discovering  important  classes  within  a  set 
of  data.  Traditionally,  data  are  represented  as  points  in  some  d  dimensional 
feature  space,  where  each  dimension  is  some  ordinal  scale.  Such  a  represen¬ 
tation  allows  one  to  construct  various  distance  metrics,  and  then  to  use  those 
metrics  to  define  “good  clusters.”  Algorithms  are  then  developed  to  discover 
such  clusters  in  the  data.  We  present  of  brief  analysis  of  common  metrics 
and  methods  used  in  cluster  analysis,  and  will  relate  these  comments  to  our 
current  question  of  object  categorization  for  recognition.  The  presentation 
here  is  drawn  in  part  from  Duda  and  Hart  [1973]  and  Hand  [1981], 

3.2.1  Distance  metrics 

At  the  heart  of  every  clustering  algorithm  lies  a  distance  metric  which  defines 
the  distance  between  any  two  data  points.  Most  of  these  metrics  require 
that  the  data  be  represented  as  points  in  an  d  dimensional  space,  and  that 
distances  along  each  dimension  be  well  defined.2  Standard  numerical  axis  are 

2Some  approaches  to  cluster  analysis  have  defined  distance  metrics  on  representations 
which  use  binary  (as  opposed  to  ordinal)  dimensions  (see  for  example  Jardine  and  Sibson 
[1971]).  The  distance  between  two  objects  is  defined  to  be  the  Hamming  distance:  the 
number  of  dimensions  on  which  the  objects  take  different  values.  The  similarity  between 
two  objects  —  the  logical  inverse  of  distance  —  is  referred  to  as  the  matching  coefficient. 
These  metrics,  however,  have  difficulties  similar  to  those  associated  with  traditional  dis¬ 
tance  metrics  (see  text).  Problems  of  scale  become  problems  of  resolution  and  relative 


■  V.C. ^  V.A  /.  c,  .'.AW.  •'  X  ■*_  .w. J-  *  „■ 


v5$ 


r.-rT- 


--.V.' 

✓.V.* 


v.V’.-yvV  ^ 


typically  used  in  reed  applications  [Hand,  1981].  The  notation  we  will  adopt 
is  each  object  or  data  point  is  represented  by  a  vector  x  =  {xj,  X2, . . . ,  xj) 
where  x;  is  the  value  for  x  in  the  itfl  dimension. 

An  important  question  in  designing  a  distance  metric  for  such  a  system  is 
whether  the  measure  should  be  scale  invariant.  For  example,  one  can  assume 
that  the  values  along  each  dimension  are  normally  distributed;  scaling  would 
then  consist  of  a  linear  transformation  of  each  dimension  to  yield  unit  vari¬ 
ances.  The  difficulty  in  deciding  whether  such  scaling  should  be  performed  is 
illustrated  in  figure  Figure  3.1.  Here,  a  rescaling  of  the  dimensions  changes 
the  apparent  clusters.  If  a  measure  were  scale  invariant,  it  would  not  be  able 
to  detect  the  differences  between  these  two  data  distributions.  Whether  this 
behavior  is  desirable  depends  on  the  domain  and  the  semantics  of  each  of  the 
dimensions.  That  is,  one  cannot  decide  on  the  basis  of  the  data  alone  whether 
scaling  is  appropriate.  This  requirement  for  outside  information  is  similar 
to  Watanabe’s  argument  against  natural  categories  presented  in  chapter  2: 
knowledge  of  which  features  are  important  cannot  be  determined  by  looking 
only  at  the  data  itself  without  additional  information  being  provided. 

Ignoring  the  issue  of  scaling,  we  can  consider  several  distance  metrics 
which  assume  that  the  dimensions  are  appropriately  scaled.3  One  common 
distance  metric  is  the  usual  Euclidean  metric: 
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Mx,y)=  £(*,-- y<)2 
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By  using  the  Euclidean  measure,  one  is  malting  the  assumption  that  different 
dimensions  are  compatible  and  that  distances  along  a  diagonal  are  equivalent 
to  distances  along  a  dimension.  Often,  such  an  assumption  is  unreasonable: 
combining  years-of-education  and  height  yields  no  meaningful  quantity.  In 
such  cases,  the  city  block  metric  is  more  appropriate: 

d 

di(x,y)  =  £|x,  -  y,| 


In  this  case  the  dimensions  are  weighted  equally,  but  no  interpretation  is 
given  to  the  interaction  between  dimensions. 

importance;  other  issues  concerning  the  use  of  such  metrics  remain  the  same. 

3As  such  we  will  not  describe  such  classic  measures  as  Mahalanobis's  distance,  which 
assumes  the  data  are  sampled  from  a  multi-variate  normal  distribution  and  scales  the 
data  by  the  inverse  of  the  estimated  cross  correlation  matrix. 


Both  diand  02  axe  special  cases  of  the  general  Minkowski  distance: 


<k(x,y)=  52i*,-y<r 
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By  varying  r  one  can  control  the  degree  of  interaction  between  dimensions. 
When  r  =  1,  The  Minkowski  measure  equals  the  city  block  metric;  r  =  2 
yields  the  Euclidean  measure.  As  r  — ►  00,  the  Minkowski  distance  converges 
to 

d4(x,y)  =  maxjx,  -  y,| 

which  represents  “total”  interaction:  the  dimension  along  which  two  data 
points  differ  the  greatest  completely  dominates  the  other  dimensions. 

There  are  two  fundamental  assumptions  in  distance  metrics  such  as  these 
whose  validity  is  questionable  if  the  task  is  one  of  categorizing  objects  into 
categories  suitable  for  recognition.  The  first  of  these  is  the  assumption  that 
there  are  no  (or  at  most  few)  dimensions  that  are  unconstrained  in  the  data. 
If  there  are  many  such  dimensions,  then  the  distances  between  objects  in 
these  dimensions  will  act  as  noise,  making  it  difficult  to  detect  the  important 
distances  along  the  constrained  dimensions.  When  attempting  to  categorize 
objects  for  recognition,  the  important  properties  —  properties  which  are 
indicative  of  an  object’s  category  —  are  as  yet  unknown.  Thus,  it  is  likely 
that  some  of  the  properties  measured  will  be  unconstrained  in  the  objects. 

The  second  basic  assumption  is  that  the  same  distance  metric  is  appli¬ 
cable  throughout  all  of  feature  space.  Normally,  these  distance  metrics  are 
insensitive  to  the  absolute  values  of  the  feature  vectors  being  compared;  the 
distances  between  data  points  are  determined  solely  by  the  differences  along 
each  dimension.  Thus,  these  metrics  do  not  alter  their  behavior  as  a  function 
of  a  feature  vector’s  position  in  feature  space.  With  respect  to  categorization, 
this  assumption  requires  that  the  properties  that  are  important  for  measur 
ing  the  distance  between  some  particular  pair  of  objects  must  be  important 
for  all  pairs  of  objects.  This  requirement  does  not  seem  reasonable  for  a 
world  in  which  the  constrained  properties  of  objects  vary  from  one  object 
type  to  another. 

Finally,  most  clustering  algorithms  require  being  able  to  specify  not  only 
the  distance  between  two  data  points,  but  also  the  distance  between  a  data 
point  and  a  cluster  of  data  points;  the  distance  between  two  clusters  is  often 
required  as  well.  Because  the  measure  of  distance  between  clusters  is  often 
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constrained  by  the  algorithm  used  for  discovering  clusters,  we  will  present 
the  inter-cluster  measures  in  those  sections  discussing  clustering  methods. 

3.2.2  Hierarchical  methods 

Most  cluster  analysis  programs  cam  be  described  as  being  one  of  two  types 
of  algorithms,  or  as  being  a  hybrid  of  the  two.  The  first  of  these  consists  of 
hierarchical  methods  which  automatically  produce  a  taxonomy  of  the  data 
samples.  In  divisive  clustering,  the  taxonomy  is  constructed  by  starting  with 
all  data  points  belonging  to  a  single  cluster  and  then  splitting  clusters  until 
each  object  is  its  own  class.  Agglomeraiive  methods  begin  with  each  sample 
as  a  separate  cluster  of  size  one  and  then  merge  classes  until  all  samples 
are  in  one  category.  Since  similar  issues  underlie  both  techniques  we  will 
consider  only  the  agglomerative  methods.  Our  discussion  will  follow  that  of 
Duda  and  Hart  [1973]. 

For  this  discussion,  c  represents  the  number  of  clusters;  .V,  is  the  ith 
cluster,  a  set  of  data  points;  x_,  is  the  jth  data  point,  represented  as  a  feature 
vector;  n  is  the  number  of  data  points.  The  basic  algorithm  for  agglomerative 
clustering  can  be  written  as  a  simple  iteration  loop: 

1.  Let  c  —  n  and  X{  =  {x,},  t  =  1, . . . ,  n. 

2.  If  c  =  1,  stop. 

3.  Find  the  nearest  pair  of  distinct  clusters,  say  X{  and  Xr 

4.  Merge  A',  and  X}  into  a  new  Xt,  delete  Xj,  and  decrement  c. 

5.  Go  to  step  2. 

When  executed,  this  procedure  produces  a  dendrogram  such  as  that  in  Fig¬ 
ure  3.2.  The  vertical  axis  measures  the  distance  between  the  clusters  as  they 
are  merged.  At  a  distance  of  1,  objects  C  and  D  were  combined  to  form  a 
new  cluster;  likewise  at  a  distance  of  1.5,  A  and  B  were  combined.  Finally  at 
a  distance  of  2.5,  the  cluster  {A,  B)  and  the  cluster  { C,  D}  were  combined 
to  yield  a  single  cluster.  In  a  moment,  we  will  show  that  the  dendrogram 
does  not  always  yield  a  tree  structure  which  is  consistent  with  increasing 
distance. 
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To  execute  the  agglomerative  procedure,  one  must  define  the  distance 
between  two  clusters.  Many  measures  have  been  proposed,  but  we  can  sep¬ 
arate  them  into  those  which  find  a  minimal  or  maximal  distance  between  all 
possible  pairing  of  objects  in  the  two  clusters,  and  "hose  which  compute  some 
average  distance.  An  example  of  the  former  group  is  the  nearest- neighbor 
metric  in  which  the  distance  between  two  clusters  is  defined  to  be  the  dis¬ 
tance  between  the  two  closest  points,  one  from  each  cluster.  Because  of  the 
ability  of  a  single  data  point  to  dramatically  affect  the  distance  between  two 
clusters,  this  class  of  measures  exhibits  the  undesirable  behavior  of  being 
sensitive  to  outlying  or  “maverick”  members  in  a  cluster. 

To  remove  this  undesirable  behavior,  measures  based  upon  either  average 
distances  or  the  distances  between  average  members  axe  used.  However,  these 
metrics  can  cause  clusters  to  be  formed  that  are  “closer”  to  each  other  than 
the  sub-clusters  from  which  they  were  originally  formed.  An  example  of 
this  is  shown  in  Figure  3.3a.  In  this  case  we  assume  that  the  distance  metric 
used  between  two  clusters  is  the  Euclidean  distance  between  the  (arithmetic) 
average  of  each  cluster.  In  this  example,  data  point  A  is  merged  with  data 
point  B  because  they  are  the  closest  pair;  the  distance  between  them  is  2.2 
units.  (Note  that  A,  B,  and  C  could  be  the  average  of  previous  clusters 
found  as  opposed  to  being  single  data  points.)  Next,  data  point  C  is  merged 
with  the  new  cluster  {A,  B}  as  they  are  the  only  remaining  two  clusters. 
But,  the  distance  between  these  to  clusters  is  only  2.0  units,  less  than  the 
original  distance  between  A  and  B.  Thus,  the  dendrogram  displaying  this 
agglomerative  clustering  might  be  drawn  as  in  Figure  3.3b;  the  taxonomy  is 
no  longer  consistent  with  distance. 

For  the  task  of  partitioning  objects  into  categories  suitable  for  recogni¬ 
tion,  hierarchical  methods  have  a  serious  deficiency:  they  require  the  com¬ 
plete  data  set  be  present  at  the  start  of  the  procedure.  The  addition  of  a  new 
data  point  can  radically  alter  the  structure  of  the  dendrogram  by  providing  a 
new  link  between  previously  separated  clusters.  This  is  especially  a  problem 
for  methods  which  rely  on  an  inter-cluster  distance  metric  such  as  nearest 
neighbor.  Such  a  system  must  recompute  the  entire  dendrogram  when  new 
data  are  observed.  Because  the  observer  in  the  natural  world  will  often  en¬ 
counter  new  objects,  a  hierarchical  approach  would  not  be  appropriate  for 
creating  natural  object  categories. 

We  conclude  our  description  of  hierarchical  methods  by  commenting  on 
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Figure  3.3:  (a)  Data  poin 
indicated.  The  distance  b« 


Figure  3.4:  A  hypothetical  dendrogram.  If  there  is  some  physical  significance 
to  the  distance  measure,  one  could  infer  that  this  data  was  generated  by  several 
discrete  processes.  In  particular,  while  a  description  of  the  data  as  having  6  groups 
or  2  groups  seems  reasonable,  a  description  which  claimed  there  were  5  groups 
present  seems  arbitrary.  This  requirement  that  the  description  of  the  clusters  be 
stable  with  respect  to  the  distance  metric  is  analogous  to  Witkin's  discussion  of 
the  scale-space  description  of  image  intensity  profiles  [Witkin,  1983]. 


the  utility  of  the  dendrogram.  Suppose  one  were  to  hierarchically  cluster 
some  data  and  that  the  resulting  dendrogram  was  that  of  Figure  3.4.  Notice 
that  as  the  distance  between  clusters  is  increased  the  objects  are  quickly 
clustered  into  6  groups.  Then,  after  increasing  the  distance  sufficiently,  3  of 
the  groups  are  merged  in  quick  succession  while  the  others  clusters  remain 
separate.  The  process  repeats  for  the  other  set  of  3  clusters.  Eventually,  the  2 
clusters  are  combined  to  form  one  global  category.  An  intuitive  interpretation 
of  such  a  dendrogram  is  that  there  are  discrete  processes  reflected  in  the  data 
and  that  any  valid  category  description  would  reflect  these  processes.  For 
example,  description  D2  which  partitions  the  objects  into  5  categories  seems 
less  valid  than  either  description  Di  or  D3  because  of  the  sensitivity  of  D2 
to  the  distance  metric.  If  one  thinks  of  the  distance  metric  as  the  scale  at 
which  the  data  are  observed,  then  D\  and  D3  are  stable  with  respect  to  small 
changes  in  that  scale,  whereas  D2  is  not.  Zahn  [1971]  used  a  similar  principle 
in  recovering  clusters  by  dividing  a  minimal  spanning  tree  graph  of  the  data 
at  edges  whose  length  were  inconsistent  with  the  other  edges  in  the  tree. 
The  notion  of  a  description  being  stable  with  respect  to  a  scale  parameter 
is  reminiscent  of  Witkin’s  [1983]  scale  space  description  of  image  intensity 
profiles.  In  chapters  4  and  6  we  will  return  to  this  question  of  stability  of 
description  with  respect  to  the  “scale”  of  the  observation. 

3.2.3  Optimization  methods 

The  second  basic  approach  to  cluster  analysis  is  category  optimization.  In 
these  methods,  one  assumes  that  there  exists  some  known  number  of  classes 
c.  The  data  are  first  partitioned  into  c  classes  (either  randomly  or  by  some 
hierarchical  method),  and  then  some  suitable  clustering  criterion  is  optimized 
by  transferring  data  samples  from  one  cluster  to  another.  An  example  of  such 
a  method  is  the  k- means  method  which  can  be  written  as  (following  Duda 
and  Hart  [1973]): 

1.  Choose  some  initial  values  for  means  . . . ,  fic. 

2.  Classify  the  n  samples  by  assigning  them  to  the  class  of  the  closest 
mean.  (This  is  equivalent  to  clustering  the  objects  to  minimize  the 
sum  of  the  squares  of  the  distances  of  the  data  points  to  the  cluster 
means  ft{.) 


3.  Recompute  the  means  /i, , . . . ,  fic  as  the  average  of  the  samples  in  their 
class. 

4.  If  any  mean  changed  value  go  to  Step  2;  else  STOP. 

Each  iteration  improves  some  measure  (in  this  case  the  sum  of  the  squared 
distances  from  the  data  points  to  the  cluster  means)  of  the  “goodness”  of 
the  clusters. 

As  in  all  optimization  procedures,  there  are  two  important  components  to 
the  algorithm.  The  first  component  is  the  criteria  used  to  measure  the  quality 
of  clusters.  Most  criteria  are  based  on  the  scatter  matrices  W  and  B,  repre¬ 
senting  the  within-cluster  scatter  and  between- cluster  scatter,  respectively. 
The  formulas  for  these  matrices  are  unimportant  for  the  discussion  at  hand; 
they  may  be  found  in  Duda  and  Hart  [1973,  p.  221].  Their  basic  purpose  is 
to  measure  the  compactness  of  each  cluster  and  the  inter- cluster  separation. 
Several  criteria  which  attempt  to  “minimize”  W  (in  terms  of  either  the  trace 
or  the  determinant)  and  “maximize”  B  have  been  proposed.  The  above  al¬ 
gorithm  which  attempts  to  minimize  the  squares  of  the  distances  between 
the  data  points  and  their  cluster  mean  is  equivalent  to  minimizing  the  trace 
of  W.  The  use  of  these  matrices  reveals  the  underlying  assumption  of  these 
measures  that  “good”  clusters  are  those  which  are  tight  hyper-spheres  in 
features  space,  separated  by  distances  that  are  large  with  respect  to  their 
diameters.  Whether  such  measures  are  appropriate  for  a  given  task  depends 
upon  the  validity  of  the  distance  metric.  Almost  all  analyses  using  such  scat¬ 
ter  matrices  assume  a  Euclidean  metric;  as  discussed  in  section  3.2.1  such  a 
metric  may  be  inappropriate  for  object  categorization. 

It  is  important  to  note  that  categories  which  can  be  represented  as  tight 
hyper-spheres  in  feature  space  begin  to  satisfy  the  criteria  for  a  categoriza¬ 
tion  proposed  in  chapter  2.  If  categories  exhibit  little  within-cluster  scatter, 
then  knowledge  of  an  object’s  category  permits  a  detailed  inference  about 
that  object’s  features.  Also,  object  categorization  becomes  less  sensitive  to 
measurement  noise  when  categories  are  well  separated  in  feature  space;  the 
inference  of  an  object’s  category  from  observable  features  becomes  more  reli¬ 
able.  However,  if  the  degradation  of  an  object’s  description  is  caused  by  the 
omission  of  features  as  opposed  to  being  caused  by  noisy  measurements,  then 
separated  clusters  do  not  insure  reliable  categorization.  Separation  in  fea- 


ture  space  is  not  equivalent  to  redundancy  in  feature  space.4  As  discussed  in 
chapter  2,  categorization  for  recognition  requires  being  able  to  determine  an 
object’s  category  from  only  partial  information.  Thus,  while  the  optimiza¬ 
tion  criteria  used  for  cluster  analysis  are  related  to  the  proposed  goals  of 
categorization  and  recognition,  they  are  inadequate  for  producing  a  suitable 
set  of  categories. 

Given  a  clustering  criteria,  the  problem  of  finding  the  best  set  of  classes 
is  well  defined.  Because  there  is  only  a  finite  number  of  data  points  n,  there 
is  only  a  finite  number  of  partitions  of  the  data  into  c  classes;  clustering 
reduces  to  finding  the  best  partition.  Unfortunately,  the  number  of  possible 
partitions  is  on  the  order  of  cn/c!  (when  n  c,  see  Rota  [1964]),  making  ex¬ 
haustive  search  impossible  even  for  a  relatively  small  number  of  data  points. 
Therefore,  the  second  component  of  the  optimization  procedure  is  the  search 
algorithm  used  to  find  good  clusters. 

One  approach  is  to  use  a  pruned  complete  search,  a  form  of  branch- 
and-bound  [Winston,  1977].  Even  this  method,  however,  quickly  becomes 
combinatorially  intractable.  (Hand  [1981]  provides  an  example  with  n  =  20, 
c  =  3,  where  the  pruning  reduced  the  search  by  a  factor  of  1000,  but  still  left 
almost  one  million  partitions  to  be  considered.)  A  more  common  method  of 
search  is  that  of  gradient  descent,  where  objects  are  incrementally  transferred 
from  one  cluster  to  another  to  improve  the  clustering  criterion.  In  the  k- 
means  method,  clusters  are  modified  by  transferring  each  point  to  the  cluster 
whose  mean  is  closest  to  that  point.  However,  such  a  method  is  sensitive 
to  the  initial  hypothesis  and,  as  with  all  gradient  descent  algorithms,  may 
terminate  at  a  local  minimum.  One  radical  approach  to  search  is  to  try 
random  partitions  in  an  effort  to  find  one  of  the  best  m  partitions  by  testing 
a  set  of  M  partitions.  [Fortier  and  Solomon,  1966].  Simple  probability  theory 
can  determine  how  large  M  must  be  in  order  to  be  likely  to  discover  one  of 
the  m  best  partitions.  The  difficulty  with  this  approach  is  that  M  grows  too 
quickly  with  n  for  some  fixed  probability  of  success.  In  chapter  5  we  will 
develop  a  similar  strategy  for  recovering  classes  of  objects,  but  will  apply  the 
random  search  to  only  small  subsets  of  the  n  samples.  By  restricting  the  the 
random  search  to  small  sets,  we  can  maintain  a  high  probability  of  success 
without  testing  arbitrarily  large  numbers  of  partitions. 


4In  chapter  4  we  will  provide  a  formal  definition  for  redundancy.  For  now,  let  us  as¬ 
sume  redundancy  measures  how  easily  one  can  categorize  an  object  given  only  a  partial 
description  of  that  object. 
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As  a  final  comment  we  note  that  the  use  of  optimization  methods  usually 
requires  the  a  priori  knowledge  of  the  number  of  clusters  present;  often,  such 
a  priori  information  is  not  available.  One  solution  to  this  problem  is  to 
augment  the  search  procedure  with  the  ability  to  split  or  merge  categories 
at  “appropriate”  times;  such  a  capability  also  allows  optimization  methods 
to  cope  with  the  addition  of  new  data  points.  The  well  known  ISODATA 
program  of  Ball  and  Hall  [1967]  provides  such  a  mechanism  in  a  clustering 
program  which  uses  the  trace  of  W  presented  above  as  the  optimization 
criteria.  If  the  sum  of  the  squared  distances  between  the  data  points  and  a 
the  mean  of  any  cluster  becomes  greater  than  some  user  specified  threshold, 
then  the  cluster  is  split  into  two  clusters.  Likewise,  pairs  of  clusters  are 
merged  if  their  means  are  separated  by  a  distance  less  than  some  other  user 
specified  threshold.  This  method  is  successful  for  limited  domains  where  such 
thresholds  can  be  specified.  However,  a  more  consistent  approach  to  cluster 
splitting  and  merging  is  to  do  so  whenever  such  a  change  will  improve  the 
criteria  measure.  Such  a  procedure  is  possible  only  if  the  clustering  criteria 
is  not  biased  towards  having  many  or  few  clusters.  For  example,  the  sum 
of  squared  distances  is  always  reduced  by  splitting  a  cluster  and  thus  wouid 
bias  the  procedure  to  find  many  (in  fact  n)  clusters.  Because  the  measure 
of  the  quality  of  a  categorization  we  will  develop  in  chapter  4  is  not  biased 
to  having  many  or  few  clusters,  we  will  be  able  to  split  and  merge  clusters 
according  to  the  improvement  of  the  clustering  measure. 

3.2.4  Cluster  validity 

An  alternative  to  adding  a  cluster  formation  and  deletion  ability  to  opti¬ 
mization  methods  is  to  simply  execute  the  same  optimization  procedure  for 
a  range  of  c,  and  then  to  compare  the  results.  However,  to  select  one  c 
over  another  requires  being  able  to  assess  the  validity  of  a  clustering  of  some 
data.  Likewise,  when  generating  a  taxonomy  with  a  hierarchical  method, 
one  is  guaranteed  that  there  exists  a  clustering  of  the  data  into  c  classes  for 
all  c,  1  <  c  <  n.  To  determine  which  of  these  descriptions  represents  “struc¬ 
ture”  in  the  data  requires  some  method  of  determining  whether  a  particular 
clustering  is  an  arbitrary  grouping  of  the  data  imposed  by  the  algorithm,  or 
a  grouping  robustly  determined  by  the  data  themselves.  Unfortunately,  few 
methods  for  answering  this  question  exist,  and  most  of  these  are  weak. 

One  formal  method  of  assessing  cluster  validity  is  based  on  statistical 


assumptions  about  the  distribution  of  the  data.  As  an  example,  consider  an 
optimization  method  which  seeks  to  minimize  the  sum  of  the  squared  within- 
cluster  distances.  Because  c+  1  clusters  will  always  fit  the  data  better  than 
c  clusters,  we  cannot  use  the  absolute  measure  to  determine  which  clustering 
is  to  be  preferred.  However,  suppose  we  assume  that  the  underlying  data 
are  sampled  from  c  normally  distributed  classes.  Then,  one  can  derive  an 
expected  value  for  how  much  the  clustering  criteria  would  improve  by  using 
c  - f  1  clusters  instead  of  c.  (For  details  of  such  a  derivation  see  Duda  and 
Hart  [1973].)  By  comparing  this  value  to  the  actual  improvement  obtained 
by  splitting  the  c  clusters  into  c  +  1  clusters,  one  can  determine  the  validity 
of  the  new  cluster.  This  method  is  only  applicable  when  some  underlying 
distribution  of  the  data  can  be  assumed  and  thus  has  limited  applicability 
to  domains  where  one  is  attempting  to  discover  the  structure  of  data.  Be¬ 
cause  cluster  analysis  is  usually  used  as  a  tool  for  such  discovery,  statistical 
measures  of  validity  axe  highly  suspect. 

A  simpler,  intuitive  approach  to  the  validity  problem  may  be  referred  to 
as  “leave  some  out”  methods  [Hartigan,  1975].  In  these  methods  either  some 
of  the  data  points  or  some  of  the  dimensions  used  to  describe  the  samples 
are  omitted  while  executing  the  clustering  procedure.  After  a  set  of  clusters 
is  generated  (or,  in  hierarchical  methods,  selected  from  the  taxonomy)  the 
additional  data  points  or  dimensions  axe  checked  for  consistency.  A  stan¬ 
dard,  though  weak,  method  of  checking  is  to  test  the  statistical  distribution 
of  the  additional  sample  points  or  dimensions.  For  example,  the  distribution 
of  values  along  a  previously  omitted  dimension  would  be  checked  for  statis¬ 
tically  significant  differences  between  clusters.  If  such  a  difference  is  found, 
then  the  belief  that  the  discovered  clusters  reflect  structure  in  the  data  is 
strengthened.  The  weakness  of  this  method  is  that  not  finding  a  significant 
difference  only  determines  that  some  particular  dimension  is  not  constrained 
within  the  clusters  discovered.  Assuming  a  sufficient  quantity  of  constrained 
dimensions,  one  would  test  other  omitted  dimensions,  hoping  to  find  con¬ 
strained  dimensions  that  supported  the  clustering  discovered.  Because  of 
the  qualitative  nature  of  these  methods,  little  formed  analysis  is  possible. 

The  problem  of  whether  an  unconstrained  dimension  disconfirms  the  be¬ 
lief  that  a  particular  clustering  is  valid  brings  to  light  a  fundamental  short¬ 
coming  of  cluster  analysis:  there  is  no  a  priori  criteria  for  success.  Let  us 
assume  that  we  have  arbitrarily  fast  computing  machinery  and  that  we  se¬ 
lect  the  optimal  partition  of  some  data  according  to  a  particular  clustering 
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criteria.  It  makes  no  sense  to  ask  whether  these  clusters  are  “valid”  classes, 
because  by  definition  the  groups  which  minimize  the  metric  are  the  right 
groups.  Therefore,  if  one  wants  to  be  able  to  say  that  the  recovered  classes 
are  “valid”  for  some  task,  then  it  must  be  the  case  that  the  criteria  used 
directly  measures  validity  of  the  classes  for  the  task  at  hand.  In  chapters 
1  and  2  we  defined  the  categorization  task  to  be  that  of  creating  a  set  of 
categories  that  permitted  robust  categorization  and  the  reliable  inference  of 
an  object’s  properties  from  its  category.  Thus,  if  we  are  to  create  such  a 
set  of  categories,  will  need  to  create  a  metric  which  directly  measures  these 
aspects  of  a  categorization. 

3.2.5  Clustering  vs.  classification 

It  is  important  to  note  the  difference  between  cluster  analysis  as  described 
above  and  pattern  classification  [Hand,  1981].  The  term  “classification”  usu¬ 
ally  refers  to  the  problem  of  deciding  to  which  of  a  known  set  of  categories  a 
novel  object  belongs.  Most  of  the  pattern  recognition  and  classification  liter¬ 
ature  does  not  address  the  problem  of  discovering  categories  in  a  population 
of  objects.  It  is  assumed  that  a  data  analyst  will  provide  a  representative  set 
of  known  instances;  the  problem  of  classification  is  to  determine  a  measure 
or  procedure  by  which  new  objects  can  be  correctly  classified. 

However,  one  aspect  of  classification  theory  does  relate  to  the  problem  of 
discovering  structure  within  data.  Often,  the  goal  of  a  classification  program 
is  build  a  decision  tree  that  provides  an  algorithmic  decision  sequence  that 
will  correctly  classify  new  objects  [Quinlan,  1986;  Breiman,  et.  al.  1984]. 
In  constructing  such  trees,  a  trade-off  exists  between  the  mis-classification 
rate  and  the  total  complexity  of  the  decision  function,  often  measured  by 
the  number  of  nodes  in  the  decision  tree.  Breiman  et.  al.  [1986]  suggest  a 
pruning  mechanism  that  combines  the  two  criteria  using  a  free  parameter 
<*.  This  combination  of  opposing  criteria  is  similar  to  that  proposed  by 
Tversky  [1977]  for  determining  basic  level  categories  and  is  thus  subject  to 
the  same  criticism:  the  complexity  of  the  description  —  for  Breiman,  et.  al. 
the  number  of  nodes,  for  Tversky  the  number  of  categories  —  should  not 
be  confused  with  the  utility  of  a  set  of  categories.  However,  the  principle 
of  trading  ease  of  category  inference  for  a  more  powerful  set  of  categories  is 
important  and  will  be  central  to  the  theory  developed  in  this  thesis. 
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3.2.6  Summary  of  cluster  analysis 

Let  us  summarize  the  aspects  of  cluster  analysis  relevant  to  the  task  of  ob¬ 
ject  categorization.  Three  serious  deficiencies  in  cluster  analysis  techniques 
were  identified.  First,  the  use  of  a  distance  metric  which  requires  constraint 
in  all  dimensions  and  is  applied  uniformly  throughout  feature  space  is  inap¬ 
propriate  for  natural  object  categorization.  Different  processes  in  the  world 
constrain  different  properties  of  objects  and  one  must  expect  that  each  class 
of  objects  will  have  unconstrained  dimensions.  Second,  methods  of  category 
formation  that  require  the  entire  set  of  data  be  available  initially  (such  as 
hierarchical  methods)  are  not  applicable  in  the  natural  world  where  new  ob¬ 
jects  are  often  encountered.  Third,  optimization  criteria  are  only  useful  if 
they  directly  measure  the  utility  of  the  categories  for  a  particular  task.  Cri¬ 
teria  based  only  on  distance  in  feature  space  cannot  guarantee  the  formation 
of  categories  which  permit  the  inferences  required  for  the  recognition  task. 

However,  two  positive  aspects  of  cluster  analysis  were  also  noted.  First, 
the  dendrogram  formed  by  hierarchical  methods  provides  a  method  for  test¬ 
ing  the  stability  of  a  clustering  with  respect  to  the  distance  between  clusters. 
We  argued  that  it  might  be  possible  to  test  the  validity  of  a  categorization 
if  the  “distance”  axis  was  sensitive  to  different  processes  involved  in  the  cre¬ 
ation  of  the  data  points  (objects).  Second,  the  tight  hyper-sphere  categories 
preferred  by  the  criteria  based  upon  scatter-matrices  begin  to  satisfy  the 
goals  of  a  categorization  established  in  chapter  2:  the  reliable  inference  of 
an  object’s  properties  from  its  category,  and  the  reliable  inference  of  an  ob¬ 
jects  category  from  its  properties.  Better  categories  can  be  chosen  only  if 
the  clustering  criteria  directly  measure  how  well  the  categories  support  these 
goals. 

3.3  Machine  learning 

The  last  field  of  research  we  must  consider  is  that  of  machine  learning.  Ma¬ 
chine  learning  is  concerned  with  the  issues  involved  in  constructing  a  machine 
(program)  that  can  discover  structure  in  the  world  by  examining  specific  in¬ 
stances.  Whether  the  problem  is  to  “discover”  the  laws  of  thermodynamics 
by  “observing”  experiments  [Langley,  Bradshaw,  and  Simon,  1983]  or  to 
learn  the  rules  integration  by  being  shown  examples  [Mitchell,  1983],  the 
basic  learning  step  requires  induction :  the  formation  of  a  general  conclusion 
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based  on  evidence  from  particular  examples.  Unlike  deduction  programs 
which  derive  conclusions  known  to  be  true,  induction  programs  are  con¬ 
structed  such  that  the  conclusions  they  derive  are  likely  to  be  true. 

For  example,  the  BACON  program  [Langley,  Bradshaw,  and  Simon,  1983] 
is  able  to  discover  scientific  laws  such  a s  F  =  ma,  V  =  IR ,  and  PV  =  nRT. 
The  reason  BACON  is  successful  in  these  cases  is  that  the  program  explicitly 
seeks  relations  formed  by  simple  additive  and  multiplicative  manipulations. 
That  is,  embedded  within  the  program  is  the  belief  (on  the  part  of  the 
programmer)  that  if  relations  of  this  form  adequately  describe  the  data,  then 
these  relations  are  the  correct  natural  laws.  Furthermore,  there  is  the  belief 
that  laws  of  this  form  exist,  thereby  justifying  a  search  for  these  relations. 

3.3.1  Conceptual  clustering 

One  focus  of  machine  learning  which  is  relevant  to  the  task  of  categorization 
is  in  the  area  of  conceptual  clustering  [Michalski,  1980;  Michalski  and  Stepp, 
1983a, b],  a  paradigm  similar  to  the  cluster  analysis  methodology  presented 
above.  As  in  cluster  analysis,  the  task  at  hand  is  to  categorize  a  set  of  data 
points  into  “good”  classes.  However,  in  conceptual  clustering  the  notion 
of  “good”  is  not  (solely)  based  upon  a  distance  metric,  but  also  on  an  a 
priori  belief  as  to  what  types  of  cluster  descriptions  are  “natural.”  Similar 
to  the  discovery  program  BACON  which  makes  an  assumption  about  the 
form  of  a  natural  law,  conceptual  clustering  programs  make  an  assumption 
about  the  form  that  descriptions  of  natural  clusters  should  have.  We  shall 
need  to  relate  the  particular  beliefs  about  the  desired  form  for  descriptions 
of  natural  classes  to  the  goals  of  categorization  and  the  principle  of  natural 
modes  presented  in  chapter  2. 

As  an  example  of  conceptual  clustering,  let  us  consider  the  work  of 
Michalski  and  Stepp  [1983a, b].  In  their  system  —  CLUSTER/2  —  data 
points  are  represented  as  feature  vectors,  but  the  the  dimensions  are  not 
necessarily  ordinal.  Typical  features  would  be  “shape”  or  “color”  which 
could  take  values  such  as  “red”  or  “round,”  respectively.  Convex  subsets5 
of  data  points  are  described  by  conjunctive  combinations  of  internally  dis- 

5Convex  is  not  exactly  the  correct  description  since  the  nominal  features  (e  g.  “color”)  are 
not  metric.  However,  if  they  were,  and  they  were  arranged  (just  for  this  conjunction) 
such  that  the  internal  disjunctions  (e  g.  red  V  blue)  were  sequential,  then  the  sets  would 
be  convex. 


junctive  feature  selectors ;  these  combinations  are  referred  to  as  conjunctive 
complexes.  For  example, 

[shape  =  round]  [color  —  red  V  blue]  [size  >  medium] 

would  describe  the  set  of  all  red  or  blue  round  things  that  are  at  least  size 
medium.  Arbitrary  clusters  can  then  be  represented  by  the  union  of  such 
conjunctive  complexes;  these  unions,  which  are  made  as  simple  as  possible  by 
eliminating  any  redundant  complexes,  axe  referred  to  as  stars.  When  a  set  of 
clusters  is  finally  chosen,  the  stars  may  be  used  as  conceptual  descriptions  of 
the  clusters.  Unlike  standard  cluster  analysis,  conceptual  clustering  produces 
a  description  —  claimed  to  be  conceptual  —  of  the  recovered  classes. 

Michalski  and  Stepp  describe  a  procedure  for  clustering  similar  to  the 
fc-means  method  of  cluster  analysis  described  in  the  previous  section.  An 
initial  set  of  c  “seed”  data  points  are  chosen  and  “good”  clusters  described 
by  the  unions  of  conjunctive  complexes  are  built  around  those  seeds.  Then, 
an  iteration  loop  is  executed  in  an  attempt  to  select  new  seeds  that  yield 
better  clusters.  The  details  of  how  seeds  are  selected,  and  of  how  clusters 
are  constructed  are  not  important  for  relating  this  work  to  the  problem  of 
categorizing  objects  for  recognition.  Of  interest  are  the  criteria  used  to  judge 
the  quality  of  a  clustering,  and  how  those  criteria  relate  to  the  proposed  goals 
of  categorization  and  the  principle  of  natural  modes. 

Michalski  and  Stepp  describe  four  component  criteria  relevant  to  the 
present  discussion.  Each  represents  a  different,  intuitively  desirable  prop¬ 
erty  for  “good”  clusters.  The  first  two  —  commonality  and  disjointness  — 
resemble  the  scatter  matrices  of  cluster  analysis.  Commonality  refers  to  the 
number  of  properties  shared  by  data  points  within  a  cluster;  if  sharing  of 
properties  is  used  to  define  a  distance  metric,  then  commonality  resembles 
the  inverse  of  the  within-cluster  scatter.  Likewise,  disjointness  measures 
the  degree  of  separation  —  how  little  they  overlap  —  between  each  pair 
of  complexes  taken  from  different  stars;  this  measure  is  analogous  to  the 
between-cluster  scatter.  As  previously  mentioned,  clustering  criteria  based 
upon  these  scatter  matrices  favor  categories  that  are  tight  hyper-spheres  in 
feature  space.  Also,  as  discussed,  such  categories  begin  to  satisfy  the  criteria 
of  categorization  proposed  in  chapter  2. 

The  next  component  of  the  clustering  criteria  reflects  an  assumption 
about  the  goal  of  categorization.  Discriminat  dity6  measures  the  degree  of 


ease  in  determining  the  cluster  to  which  an  object  belongs  given  only  a  par¬ 
tial  description  of  the  object.  As  a  clustering  becomes  more  discriminable, 
less  information  is  required  (on  average)  to  identify  an  object’s  category. 
This  criteria  corresponds  to  one  of  the  goals  of  categorization  outlined  in 
chapter  2:  reliable  categorization  when  provided  with  partial  information. 

The  final  element  of  the  clustering  criteria  of  Michalski  and  Stepp  is 
simplicity,  and  it  is  an  assumption  about  what  constitutes  a  “meaningful” 
category  in  the  world.  Simplicity  is  defined  as  the  negative  of  the  complexity, 
which  is  simply  the  total  number  of  feature  selectors  in  all  the  cluster  de¬ 
scriptions.  This  criterion  reflects  the  assumption  that  the  most  meaningful 
categories  are  those  that  can  be  described  by  a  small  number  of  properties. 
Let  us  consider  the  validity  of  the  simplicity  criterion  in  light  of  the  prin¬ 
ciple  of  natural  modes  In  one  respect,  simplicity  is  consistent  with  a  modal 
world:  if  natural  classes  are  highly  clustered  in  the  space  of  important  en¬ 
vironmental  properties,  then  only  a  small  number  of  these  properties  need 
be  described  to  classify  an  object.  However,  when  posed  in  this  manner, 
this  criterion  is  equivalent  to  the  discriminability  criterion  above.  The  more 
fundamental  meaning  of  simplicity  is  that  the  clusters  are  defined  by  a  small 
number  of  properties;  this  is  the  view  of  simplicity  intended  by  Michalski 
and  Stepp,  as  they  refer  to  the  conceptual  description  of  the  clusters  as  the 
“meaning”  of  the  classes.  In  this  light,  simplicity  is  at  odds  with  the  principle 
of  natural  modes,  which  posits  the  existence  of  highly  structured,  complex 
classes.  These  categories  are  discriminable  because  their  complex  structures 
are  highly  dissimilar;  complex  environmental  pressures  cause  objects’  config¬ 
urations  to  be  different  from  one  another  in  a  large  number  of  dimensions. 
Thus,  simplicity  —  an  intuitively  appealing  criterion  —  cannot  be  regarded 
as  consistent  with  the  goal  of  categorizing  objects  according  to  their  natural 
modes. 

of  their  clustering  procedure  [Michalski  and  Stepp,  1983a, b].  The  discrimination  index  is 
defined  to  be  the  number  of  dimensions  that  singly  distinguish  all  of  the  clusters  —  they 
take  on  a  different  value  for  each  cluster.  Dimensionality  reduction  is  defined  to  be  the 
negative  of  the  number  of  dimensions  required  to  uniquely  identify  the  cluster  to  which 
an  object  belongs;  the  negative  value  is  used  so  that  the  value  increases  as  a  clustering 
becomes  more  discriminable.  If  the  discrimination  index  is  greater  than  zero  (at  least 
one  dimension  singly  distinguishes  all  of  the  clusters),  then  the  dimensionality  reduction 
must  be  —1.  We  can  define  discriminability  to  be  the  sum  of  these  two  values:  the  greater 
the  value,  the  less  restricted  is  the  information  that  will  uniquely  determine  an  object’s 
category. 


In  summary,  conceptual  clustering  represents  an  improvement  over  stan¬ 
dard  cluster  analysis.  Besides  the  the  advertised  extension  of  providing  a 
description  of  the  created  clusters,  conceptual  clustering  utilizes  criteria  that 
consider  the  goals  of  the  observer  —  discriminability  improves  reliability  — 
and  an  a  priori  belief  about  the  structure  of  natural  classes  —  simple  classes 
are  preferred.  However,  we  have  argued  that  the  assumption  that  simple 
descriptions  are  the  right  descriptions  is  not  valid  for  the  task  of  catego¬ 
rizing  objects  in  the  natural  world;  natural  objects  are  highly  constrained 
and  thus  complex  in  structure.  Furthermore,  conceptual  clustering  faces  the 
same  category  validity  problem  as  cluster  analysis.  The  categories  recovered 
are  those  which  optimize  the  particular  set  of  criteria  chosen;  the  criteria 
were  not  chosen  according  to  some  task  requirement.  Thus  it  is  difficult  to 
assess  the  utility  of  the  recovered  classes  for  a  specified  task  such  as  that 
proposed  in  chapter  2:  the  reliable  inference  of  an  object’s  properties  from 
its  category. 

We  have  not  presented  the  method  used  by  Michalski  and  Stepp  to  find 
possible  clusters  (they  use  a  form  of  a  bounded  best-first  search)  as  it  re¬ 
sembles  search  techniques  used  in  standard  cluster  analysis.  The  procedure 
is  iterative  and  not  well  suited  to  a  system  which  must  dynamically  gener¬ 
ate  categories  as  new  data  are  observed.  Also,  the  computational  expense  of 
forming  these  good,  but  certainly  not  optimal,  clusters  is  almost  prohibitive.7 

3.3.2  Explanation-based  learning 

We  stated  that  the  criteria  of  simplicity  used  by  Michalski  and  Stepp  [1983a, b] 
reflected  an  assumption  about  the  structure  of  categories  in  the  world.  As 
with  all  similarity-based  methods,  the  vocabulary  on  which  the  syntactic 
operations  are  performed  (operations  such  as  measuring  the  complexity  of  a 
cluster  by  counting  the  number  of  feature  selectors  used)  implicitly  embodies 
a  theory  about  the  world.  As  demonstrated  in  the  proof  of  the  Ugly  Duck¬ 
ling  Theorem  in  chapter  2,  a  different  set  of  predicates  can  cause  previously 
“simple”  categories  to  become  “complex.”  Unfortunately,  the  theory  embed¬ 
ded  in  similarity  based  techniques  always  remains  implicit  in  the  vocabulary. 
Thus  it  is  difficult  (if  not  impossible)  to  improve  one’s  theory  through  ex¬ 
perience,  and  it  is  difficult  to  evaluate  the  correctness  of  a  theory  except  by 


the  actual  execution  of  the  similarity-based  algorithm. 

Recently,  a  new  form  of  machine  learning  —  referred  to  as  explanation- 
based.  —  has  been  developed  in  an  attempt  to  incorporate  an  explicit  theory 
about  a  domain  into  the  learning  process.  For  example,  the  LEX  program 
of  Mitchell  [1983]  uses  a  priori  information  about  mathematical  relations  to 
learn  the  rules  of  symbolic  integration  from  examples  provided  by  a  teacher. 
Instead  of  just  using  syntactic  rules  for  comparing  one  formula  to  another, 
the  program  uses  its  knowledge  about  mathematical  functions  to  form  its 
generalizations.  For  example,  part  of  its  theory  includes  the  fact  that  sin 
and  cos  are  both  trigonometric  functions.  Therefore,  when  it  is  told  that  the 
integration  of  x  sinx  can  be  accomplished  by  integration-by-parts,  the 
program  hypothesizes  a  generalized  rule  that  states  x  trig  x  can  be  integrated 
by  “integration-by-parts.”  This  rule  is  maintained  unless  a  counter  example 
is  provided  by  the  teacher. 

Certainly,  an  explanation-based  approach  to  categorization  would  be 
a  more  powerful  technique  than  simple  similarity-based  methods  [DeJong, 
1986] ;8  at  present  we  are  unaware  of  any  such  attempts.  Such  an  approach 
would  require  an  underlying  theory  of  physics  of  natural  objects.  The  pro¬ 
gram  would  have  to  know  what  types  of  equivalence  classes  can  be  created 
by  different  object  processes.  Evidence  for  such  a  strategy  existing  in  organ¬ 
isms  may  be  found  in  the  work  of  Cerella  [1979]  in  which  pigeons  were  able 
to  form  a  natural  category  for  ”  white-oak- leaf’  from  the  presentation  of  just 
one  instance.  The  pigeons  must  have  an  underlying  theory  that  determines 
which  aspects  of  the  physical  structure  of  the  leaf  are  likely  to  be  important 
in  determining  its  natural  class.  In  chapter  7  we  will  consider  some  possi¬ 
ble  extensions  to  the  work  presented  in  this  thesis;  the  most  interesting  of 
these  incorporates  knowledge  of  physical  processes  into  the  mechanism  for 
recovering  natural  object  categories. 


Chapter  4 


Evaluation  of  Natural 
Categories 


In  chapter  2  we  argued  that  the  goal  of  the  observer  is  to  form  object  cat¬ 
egories  that  permit  the  reliable  inference  of  unobserved  properties  from  ob¬ 
served  properties;  we  claimed  that  to  achieve  this  goal  the  observer  should 
categorize  objects  according  to  their  natural  modes.  To  accomplish  this  task, 
the  observer  must  be  provided  with  two  separate  capabilities.  First,  he  must 
be  able  to  identify  when  a  set  of  categories  corresponds  to  a  set  of  natural 
clusters.  This  ability  requires  that  the  observer  be  given  criteria  with  which 
to  evaluate  a  particular  categorization.  The  second  capability  required  is 
that  of  being  able  to  make  “good  guesses.”  Chapter  3  included  a  section 
on  the  search  strategies  used  by  optimization  methods  of  cluster  analysis; 
such  a  search  strategy  is  necessary  because  of  the  enormous  number  of  pos¬ 
sible  partitionings  of  a  set  of  objects.  Likewise,  to  discover  “the  correct  set” 
of  categories,  the  observer  must  consider  that  particular  set  as  a  possible 
candidate.  In  this  chapter  we  develop  a  measure  of  the  extent  to  which  a 
categorization  allows  the  observer  to  make  inferences  about  the  properties 
of  objects.  We  defer  the  problem  of  generating  suitable  hypotheses  until  the 
following  chapter. 

We  proceed  by  first  considering  only  the  goals  of  the  observer,  and  de¬ 
riving  an  evaluation  function  which  measures  how  well  a  particular  catego¬ 
rization  of  objects  supports  these  goals.  We  then  describe  the  behavior  of 
this  measure  in  both  a  structured  (natural  modes)  and  unstructured  world. 
Finally,  by  means  of  an  example  drawn  from  the  natural  domain  of  leaves. 


we  demonstrate  the  the  ability  of  the  measure  to  distinguish  between  natural 
and  arbitrary  categorizations. 


4.1  Objects,  Classes,  and  Categories 

First  let  us  define  some  necessary  terminology.  We  assume  there  exists  a 
fixed  set  of  objects,  {$i,  62, . . .  ,  #n};  0  denotes  the  set  of  all  possible  objects. 
As  mentioned  in  chapter  1,  we  will  not  provide  a  definition  for  “object,” 
though  at  the  conclusion  of  the  thesis  we  will  consider  using  the  construct  of 
a  category  to  define  criteria  for  being  an  object.  A  categorization,  Z ,  is  sim¬ 
ply  a  partitioning  of  this  set  of  objects,  with  each  equivalence  class  defined 
by  the  partition  being  referred  to  as  a  category.  Notice,  that  in  this  termi¬ 
nology  (and  for  the  remainder  of  this  thesis)  categories  and  categorizations 
are  mental  constructs,  hypotheses  and  conclusions  made  by  the  observer.  In 
section  4.3.1  we  will  develop  a  formal  notation  for  deriving  expressions  in¬ 
volving  categories  and  categorizations.  The  goal  of  the  observer  is  to  create 
a  categorization  of  objects  that  support  the  goals  of  inference  established  in 
chapter  2. 

When  we  need  to  refer  to  the  structure  of  objects  in  the  world,  we  will 
refer  to  object  classes.  Thus  the  principle  of  natural  modes  states  that  ob¬ 
jects  in  the  world  are  divided  into  natural  classes;  these  classes  are  produced 
by  the  natural  object  processes  discussed  in  section  2.4.  Because  the  discus¬ 
sion  of  this  chapter  will  focus  on  the  evaluation  of  the  observer’s  proposed 
categorizations,  we  will  not  provide  a  more  extensive  definition  of  classes;  for 
a  more  formal  discussion  about  classes  see  Bobick  and  Richards  [1986]. 


4.2  Levels  of  Categorization 

We  begin  our  development  of  a  measure  of  how  well  a  categorization  supports 
the  goals  of  the  observer  by  considering  object  taxonomies,  such  as  that 
pictured  in  Figure  4.1.  (As  is  often  the  case,  trees  in  computers  grow  upside- 
down:  the  root  node  THING  is  at  the  top;  the  leaves,  e.g.  “Fido”,  at  the 
bottom.)  Each  non-terminal  node  represents  a  category  composed  of  the 
union  of  the  categories  below  it.  The  terminal  nodes  —  the  “leaves”  of 
the  tree  — -  are  categories  containing  exactly  one  object.  Given  a  set  of 
objects,  one  may  create  a  large  number  of  taxonomies.  For  the  purposes  of 


developing  a  measure  of  the  utility  of  a  categorization  we  will  assume  that 
some  particular  taxonomy  has  been  provided. 

Notice  that  the  set  of  categories  at  any  level  of  the  taxonomy  constitutes 
a  partitioning  of  the  set  of  objects  and  is  thus  a  legitimate  categorization. 
Suppose  our  task  is  to  select  the  level  which  best  allows  the  observer  to  satisfy 
his  goals  of  making  reliable  predictions  about  unobserved  (and  observed) 
properties.1  Let  us  assume  that  the  observer  will  make  these  predictions 
based  upon  the  category  to  which  he  assigns  some  object  and  the  properties 
of  other  objects  known  to  be  of  that  category.  Therefore,  to  select  the  best 
level  of  the  taxonomy,  we  have  to  consider  how  the  depth  of  the  categorization 
affects  the  ability  of  the  observer  to  correctly  categorize  an  object  and  the 
ability  to  make  predictions  about  an  object  once  its  category  is  known. 

For  the  remainder  of  this  chapter,  we  will  be  considering  only  observed 
properties,  since  we  assume  that  those  properties  are  the  only  ones  avail¬ 
able  to  the  observer  for  evaluation  of  a  categorization.  As  discussed  in  sec¬ 
tion  2.4.2  the  unobserved  properties  should  behave  similarly  to  the  observed 
properties.  Therefore,  we  assume  that  a  categorization  that  provides  good 
performance  in  terms  of  predicting  observed  properties,  and  that  allows  reli¬ 
able  categorization  based  on  those  properties,  will  also  be  good  for  predicting 
unobserved  properties. 

4.2.1  Minimizing  property  uncertainty 

First,  consider  moving  down  the  tree  from  the  root  towards  the  leaves,  mov¬ 
ing  from  THING  to  “Fido”  (Figure  4.2).  In  doing  so,  the  categories  become 
more  specialized:  knowledge  that  an  object  belongs  to  the  category  provides 
more  information  about  the  object  For  example,  knowing  that  some  ob¬ 
ject  is  a  dog  allows  the  observer  to  predict  many  more  properties  (e.g.  has 
teeth ,  has  hair ,  has  legs )  than  if  he  only  knew  the  object  was  an  animal.  At 
the  extreme  depth  of  categorization,  each  category  contains  only  one  object. 
Let  us  assume  the  observer  knows  everything  there  is  to  know  about  each 

lWe  should  point  out  that  it  is  somewhat  artificial  to  require  selecting  some  particular 
level.  This  presupposes  that  the  same  level  is  the  best  level  across  the  entire  taxonomy 
tree.  A  more  appropriate  task  would  be  to  pick  the  best  set  of  spanning  nodes,  since 
the  “best”  level  in  one  part  of  the  tree  may  be  lower  than  that  of  another  part.  For 
the  principles  to  be  developed  here,  however,  considering  a  fixed  level  of  categorization  is 
sufficient 
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Figure  4.1:  A  complete  taxonomy  of  objects.  The  question  to  be  considered 
is  what  is  the  appropriate  level  of  categorization  given  the  goals  of  recognition 
It  should  be  noted  that  number  of  possible  taxonomies  is  enormous.  For  exam¬ 
ple,  if  there  are  128  objects,  then  the  number  of  balanced  binary  taxonomies  is 
approximately  10s5. 
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instance  (e.g.  “Fido”).  Then,  at  this  finest  level  of  categorization,  knowledge 
of  the  category  to  which  an  object  belongs  allows  complete  prediction  of  its 
observed  and  unobserved  properties. 

We  can  describe  the  process  of  increasing  the  depth  of  categorization  as 
minimizing  property  uncertainty ,  which  we  denote  as  Up(Z).  This  uncer¬ 
tainty  decreases  as  objects  in  a  category  becomes  more  “similar”  to  each 
other.  Thus  property  uncertainty  measures  the  inhomogeneity  of  each  cat¬ 
egory  and  expresses  the  difficulty  the  observer  has  in  attaining  his  goal  of 
being  able  to  predict  properties  of  an  object  once  its  category  is  known. 
Later,  we  will  propose  an  explicit  measure  for  Up  which  is  claimed  to  be 
appropriate  for  perception.  For  now  we  simply  note  that  Up  decreases  as  we 
move  down  the  taxonomy  hierarchy. 

There  is,  however,  a  price  to  be  paid  for  increased  categorization  depth 
and  the  reduction  of  property  uncertainty.  As  categories  become  smaller  and 
more  refined,  the  differences  between  categories  becomes  smaller,  making  the 
task  of  categorization  more  difficult  and  less  reliable.  For  example,  to  de¬ 
termine  that  an  object  is  a  Siberian  Husky  generally  requires  more  property 
information  than  to  determine  that  it  is  a  dog.  Furthermore,  the  categoriza¬ 
tion  of  novel  objects  becomes  less  reliable  since  different  categories  are  now 
more  similar  to  each  other;  deciding  whether  some  new  object  is  a  Husky  or 
a  German  Shepherd  is  more  difficult  than  deciding  whether  it  is  a  Dog  or  a 
Horse.  Thus,  increasing  the  depth  of  categorization  facilitates  some  goals  of 
the  observer  while  hindering  others. 

4.2.2  Minimizing  category  uncertainty 

Now,  let  us  consider  climbing  the  taxonomy  tree,  with  the  categorizations 
becoming  coarser  as  we  move  from  the  finest  categories  to  the  root  node 
THING.  Now  the  categories  become  more  general:  knowledge  of  the  category 
to  which  an  object  belongs  provides  less  information  about  the  properties  of 
the  object  as  we  get  closer  to  the  root  node.  At  the  extreme,  where  there 
is  only  the  one  category  THING,  knowledge  of  an  object’s  category  permits 
almost  no  predictions  about  any  of  its  properties.  Therefore,  decreasing  the 
depth  of  categorization  decreases  the  ability  of  the  observer  to  satisfy  his 
goal  of  being  able  to  make  important  predictions  about  objects  based  upon 
their  categorization. 

As  to  be  expected,  the  sacrifice  of  the  ability  to  make  predictions  about 
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the  properties  of  objects  is  accompanied  by  a  compensatory  increase  in  the 
ease  of  categorizing  an  object.  In  general,  less  property  information  is  re¬ 
quired  to  know  that  an  object  is  a  dog  than  to  know  that  it  is  a  Siberian 
Husky.2  In  the  case  of  the  depth  zero  categorization,  where  the  only  category 
is  THING,  minimal  information  is  required  to  make  the  correct  classification.3 
Likewise,  the  ability  to  categorize  novel  objects  also  improves  with  decreas¬ 
ing  categorization  depth  since  the  categories  become  more  encompassing. 
Climbing  up  the  taxonomy  tree  reduces  the  category  uncertainty  which  we 
denote  as  Uc{Z).  As  with  increasing  categorization  depth,  decreasing  depth 
facilitates  some  of  the  recognition  goals  of  the  observer  and  hinders  others. 

To  further  refine  our  definition  of  Uc ,  we  need  to  make  the  categoriza¬ 
tion  process  more  explicit.  Let  us  assume  that  the  process  of  categorizing 
an  object  is  performed  by  looking  at  the  current  categorization  of  objects 
and  finding  the  category  whose  objects  “match  best”  —  an  operation  which 
we  will  currently  leave  undefined  —  the  object  in  question  in  their  observed 
properties.  If  given  a  complete  description  of  an  object,  and  if  that  object 
matches  only  objects  in  one  category,  then  there  is  no  uncertainty  in  the 
categorization  process.  However,  in  perception  it  is  often  the  case  that  many 
of  the  potentially  observable  properties  are  not  provided  in  an  object’s  de¬ 
scription  or  that  an  object  matches  no  object  in  the  current  categorization 
or  that  it  matches  objects  in  several  categories.  Therefore  let  us  loosely  de¬ 
fine  the  category  uncertainty  as  the  uncertainty  of  the  category  of  an  object 
given  some  of  its  observed  properties.  This  definition  also  accounts  for  ob- 

Note  there  may  exist  some  unique  identifying  property  which  will  indicate  membership  in 
some  low  level  category.  For  example,  if  one  knows  that  an  object  has  one  blue  eye  and 
one  brown  eye,  then  there  is  a  high  probability  that  the  object  is  a  Siberian  Husky.  Thus, 
for  that  particular  property,  identifying  an  object  as  a  dog  is  no  easier  than  identifying  it 
as  some  particular  type  of  dog.  However,  two  points  help  eliminate  this  concern.  First, 
by  definition,  any  property  which  helps  to  categorize  an  object  as  a  Siberian  Husky  also 
helps  to  categorize  that  object  as  a  dog.  Therefore  determining  an  object  is  a  dog  can 
be  no  more  difficult  that  determining  it  is  a  Husky.  Second,  if  we  assume  the  difficulty 
of  categorization  is  measured  not  only  by  the  number  of  properties  required  to  categorize 
an  object  but  also  by  how  restricted  those  properties  must  be,  then  the  existence  of  some 
unique  identifying  feature  does  not  make  the  Husky  categorization  easier.  Later  in  this 
chapter  we  define  a  formal  measure  of  the  uncertainty  in  categorizing  an  object  that  is 
consistent  with  this  assumption. 

We  say  “minimal”  information  as  opposed  to  none  because  some  information  might  be 
required  just  to  know  something  is  a  “Thing.”  For  example,  is  sand  in  a  sandbox  a  thing? 
This  problem  cannot  be  resolved  with  defining  what  constitutes  an  object. 
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jects  that  don’t  match  any  previous  object  since  they  presumably  do  match 
other  objects  in  some  of  their  properties.  In  later  sections  where  we  derive 
a  particular  evaluation  function,  we  will  make  more  precise  the  idea  of  some 
observed  properties. 


4.2.3  Uncertainty  of  a  categorization 

In  the  previous  two  sections  we  noted  that  as  a  categorization  becomes  either 
finer  or  coarser,  some  of  the  goals  of  the  observer  are  made  more  difficult  to 
achieve  while  others  are  made  easier.  Therefore,  if  the  observer  requires  some 
degree  of  success  in  all  his  goals,  then  the  appropriate  level  of  categorization 
must  lie  somewhere  between  the  two  extreme  granularities  of  categorization. 
However,  the  question  as  to  what  level  is  appropriate  can  not  be  answered 
until  the  desired  relative  achievement  of  the  conflicting  goals  of  the  observer 
is  specified. 

For  example,  consider  an  organism  that  has  both  simple  perceptual  needs 

—  the  properties  he  needs  to  extract  about  objects  are  few  and  quite  crude 

—  and  a  primitive  sensory  apparatus  —  the  extraction  of  complicated  infor¬ 
mation  is  quite  difficult  and  time  consuming.  Such  an  organism  would  desire 
a  set  of  categories  relatively  near  to  the  top  of  an  object  taxonomy.  Choosing 
such  a  set  corresponds  to  sacrificing  the  ability  to  make  precise  predictions 
about  the  properties  of  objects  in  exchange  for  reliable  and  time  effective 
categorization.  Inversely,  an  organism  with  great  perceptual  demands  and 
refined  sensory  mechanisms  (e.g.  primates)  would  make  the  opposite  choice: 
a  set  of  categories  that  required  encoding  more  sensory  information  but  af¬ 
forded  more  precise  predictions. 

Let  us  propose  a  categorization  evaluation  function  that  makes  explicit 
the  trade-off  between  these  two  conflicting  goals  of  the  observer.  We  assume 
we  have  a  candidate  categorization  —  a  partitioning  of  the  set  of  objects 
into  a  set  of  categories  —  and  that  our  task  is  to  evaluate  how  well  the 
categorization  satisfies  the  goals  of  the  observer.  We  require  an  evaluation 
function  that  combines  Up  and  Uc  in  such  a  manner  as  to  make  explicit 
the  trade-off  between  the  two  uncertainties.  Let  us  introduce  the  parameter 
A  to  represent  that  trade-off,  and  let  U(Up,Uc,  A)  be  the  total  uncertainty 
of  a  categorization,  where  0  <  A  <  1.  We  will  view  U  as  as  measure  of 
poorness  of  a  categorization;  the  less  total  uncertainty  a  categorization  has 
the  more  it  is  to  be  preferred.  A  is  to  be  interpreted  as  a  relative  weight 
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between  being  able  to  infer  an  object’s  properties  from  it’s  category  and 
being  able  to  infer  an  object’s  category  from  its  properties.  When  A  =  0 
only  the  ability  to  infer  properties  is  considered;  thus  the  best  categorization 
is  that  which  is  the  finest.  Likewise  when  A  =  1,  only  the  ability  to  infer  the 
category  is  important;  in  this  case  the  coarsest  categorization  is  preferred. 
The  important  questions  that  arise  are  what  are  the  preferred  categorizations 
as  A  takes  on  intermediate  values  and  how  does  the  setting  of  A  interact 
with  the  actual  classes  present  in  the  world.  We  will  have  to  postpone  the 
discussion  of  these  issues  until  after  we  derive  suitable  measures  for  Up  and 
Uc- 

4.3  Measuring  Uncertainty 

4.3.1  Property  based  representation 

The  observer  does  not  directly  categorize  the  objects  in  the  world.  Rather, 
he  can  only  operate  on  a  representation  of  those  objects.  We  define  a  rep¬ 
resentation  to  be  a  mapping  from  the  set  of  all  possible  objects,  0,  to  some 
finite  set  Q*.4  Note  that  even  though  we  required  that  each  object  9t  be  a 
member  of  only  one  category  (the  categorization  is  a  partition  in  the  math¬ 
ematical  sense)  two  distinct  objects  may  have  the  same  description  in  the 
representation  used  by  the  observer.  The  representation  of  object  6,  may  be 
identical  to  the  representation  of  object  9j,  but  since  it  is  a  different  object, 
it  is  permitted  to  be  in  a  different  category.  Of  course,  if  one  is  propos¬ 
ing  that  the  categories  of  some  categorization  correspond  to  natural  mode 
classes,  then  this  situation  would  either  be  a  violation  of  the  Principle  of  Nat¬ 
ural  Modes  or  simply  a  representation  insensitive  to  the  differences  between 
classes.5  However,  as  a  potential  categorization  it  is  certainly  permissible. 
Furthermore,  a  single  category  may  have  many  objects  with  the  same  de¬ 
scription,  which  corresponds  to  the  situation  where  the  representation  does 
not  discriminate  between  two  objects  assigned  to  the  same  category.6 

4The  finite  restriction  is  included  to  agree  with  the  intuition  that  there  is  some  limit  to 
information  encoded  by  the  observer. 

5In  chapter  5  we  will  further  consider  the  competence  of  a  representation. 

6  A  problem  with  allowing  distinct  objects  to  have  identical  descriptions  is  that  it  becomes 
impossible  to  distinguish  between  the  case  of  two  different  objects  being  so  similar  that 
they  map  to  the  same  point  in  the  representation  space  and  the  case  of  two  instances  of 
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For  our  derivation  of  the  quantities  Up  and  Uc  we  will  utilize  a  property 
based  representation.  Though  commonly  referred  to  as  feature  space  repre¬ 
sentation  [Duda  and  Hart,  1973],  we  prefer  the  term  property  description  to 
emphasize  the  fact  that  these  properties  are  of  the  objects  themselves,  not 
of  an  image  or  some  other  sensory  representation.  The  term  “feature”  will 
be  used,  but  to  refer  to  a  predicate  defined  on  objects  and  computable  from 
sensory  information.  We  should  note  that  the  form  of  the  representation  is 
not  critical  to  the  qualitative  results  derived  about  the  evaluation  function. 
If  one  prefers  some  other  representational  form,  for  example  the  volumetric 
primitive  approach  of  generalized  cylinders,  then  such  a  representation  may 
be  used  as  long  as  a  method  for  computing  Up  and  Uc  is  also  specified. 

The  terminology  of  our  property  based  representation  is  defined  as  fol¬ 
lows:  the  term  feature  refers  to  a  function  or  predicate  computed  about  an 
object;  the  term  value ,  to  the  value  taken  by  a  feature;  the  term  property , 
to  the  valued  feature.  For  example,  “length”  is  a  feature,  “6  ft.”  is  a  value, 
and  “having  length  6  ft.”  is  a  property.  Each  feature,  /<,  1  <  *  <  m,  has 
an  associated  set  of  values  {t><i,  t>,2, . . . ,  referred  to  as  the  range  of  the 
feature.  We  require  that  the  range  be  a  finite  set  but  the  cardinality  of  the 
range  can  vary  from  one  feature  to  the  next.  F  denotes  the  set  of  features 
{/ii  fit  •  •  ■  »/m}  Using  these  features,  each  object  6  is  represented  by  an  m- 
dimensional  property  vector  P  =  (i>lct,  v2p, ... ,  vm-,)  where  vtJ  is  the  jth  value 
of  the  range  of  the  ith  feature. 

As  defined  at  the  start  of  this  section,  a  categorization  is  a  partitioning  of 
the  population  of  objects,  with  each  equivalence  class  defined  by  the  partition 
being  referred  to  as  a  category.  The  symbol  Z  will  continue  to  be  used  to 
represent  some  possible  categorization;  often,  however,  the  operations  being 
discussed  will  only  be  meaningful  with  respect  to  some  categorization  and 
the  explicit  use  of  Z  will  be  omitted.  In  the  sections  that  follow  we  let  c  be 
the  number  of  categories  in  a  categorization,  and  let  C ,  be  the  ith  category. 
Also,  we  need  a  category  function,  'P  which  maps  an  object  onto  its  category 
in  the  current  categorization:  'P(^)  is  the  category  to  which  the  the  object 
9k  belongs.  We  denote  '&(9k)  as  xl>k.  The  size  of  a  category  is  expressed  by 
||C;||  or  by  \\ipk\\  depending  on  whether  referring  to  the  ith  category  or  the 
category  to  which  object  0k  belongs. 

Finally,  when  we  need  to  refer  to  the  structure  of  objects  in  the  world, 

the  same  object.  For  now,  we  assume  that  somehow  we  know  that  each  object  is  distinct. 


we  will  need  to  refer  to  the  natural  classes  present.  Recall  that  a  class 
is  distinguished  from  a  category  in  that  a  class  represents  structure  in  the 
physical  world,  where  as  a  category  is  part  of  a  categorization  proposed  by 
the  observer.  We  will  use  the  symbol  flj  to  represent  the  jtk  class. 

4.3.2  Information  theory  and  entropy 

In  our  discussion  about  the  ability  to  make  inferences  about  the  properties 
of  objects,  we  have  been  using  the  term  uncertainty  without  having  provided 
a  suitable  definition.  If  we  are  to  propose  a  measure  of  the  utility  of  a 
categorization  based  upon  uncertainty  of  inference,  we  must  have  a  formal 
definition  of  uncertainty  consistent  with  the  representation  of  objects  and 
categories. 

In  information  theory,  uncertainty  is  the  amount  of  information  which 
is  unknown  about  some  signal  [McEliece,  1977].  It  is  measured  in  terms 
of  the  probabilities  of  the  signal  being  in  each  of  its  possible  states.  For 
example  if  some  signal  A  can  be  in  one  of  two  states,  each  with  probability 
.5,  and  signal  B  can  be  in  one  of  4  states  each  with  probability  .25,  then 
there  is  said  to  be  more  uncertainty  about  signal  B,  and  signal  B  is  said 
to  convey  more  information.  Shannon,  in  his  original  work  on  information 
theory  [Shannon  and  Weaver,  1949],  derived  an  information  measure  H  based 
upon  the  entropy  of  a  probability  distribution: 

H  =  -^p.logp,  (4.1) 

>=i 

where  p,  >  0,  and  £)"L0p,  =  1.  One  of  the  elegant  results  of  that  work 
was  the  demonstration  that  any  measure  of  uncertainty  must  use  a  plogp 
formulation  if  it  is  to  satisfy  several  desirable  and  intuitive  constraints  about 
information  and  communication.  As  such,  entropy  has  become  the  standard 
means  of  measuring  uncertainty  [McEliece,  1977]. 

The  question  we  need  to  consider  is  whether  it  is  appropriate  to  consider 
the  uncertainty  in  the  perceptual  process  to  be  similar  to  uncertainty  in 
the  theory  of  communication.  If  so,  then  entropy  is  a  natural  measure  in 
which  to  express  uncertainty.  Perhaps  the  simplest  answer  to  this  question 
is  that  perception  is  communication.  We  can  view  the  perceptual  process 
as  communication  between  what  is  being  observed  and  the  observer.  The 
channel  consists  of  the  sensory  apparatus;  the  coded  message,  the  sensory 


input.  It  is  the  task  of  the  observer  to  decode  the  actual  message  from  sensory 
input.  As  such  we  claim  that  the  traditional  measure  of  uncertainty  in 
communication  theory  is  an  appropriate  measure  of  perceptual  uncertainty.7 

Also,  the  particular  form  of  the  uncertainty  measure  is  not  critical  to 
the  work  described  here.  In  fact,  an  implementation  not  reported  in  this 
document  made  use  of  a  measure  based  on  the  probability  of  making  an 
error  if  the  observer  made  his  best  guess  about  an  object’s  category.  The  re¬ 
sults  of  that  implementation  were  similar  to  those  achieved  with  the  entropy 
measure. 

When  deriving  the  uncertainty  measures  of  the  next  sections  it  will  be 
useful  to  keep  in  mind  three  properties  of  the  entropy  measure  H  that  are 
consistent  with  one’s  intuition  about  measuring  uncertainty.  First,  H  =  0 
if  there  is  only  one  possible  state,  i.e.  p,  =  1  and  for  all  j  ^  i ,  pj  =  0. 
Thus,  when  only  one  alternative  exists  (say,  about  the  category  of  an  object) 
the  uncertainty  measure  equals  zero.  Second,  for  the  case  when  all  the 
probabilities  are  equal,  p,  =  pj  for  all  i  and  j ,  H  increases  as  the  number 
of  choices  increases.  The  greater  the  number  of  alternatives,  the  greater  the 
uncertainty.  Finally,  for  a  fixed  number  of  alternatives  m,  H  is  a  maximum 
when  all  of  the  probabilities  are  equal,  and  that  maximum  value  of  H  is 
logm.  Uncertainty  is  the  greatest  when  one  has  no  reason  to  prefer  one 
alternative  over  another. 

4.3.3  Measuring  Up 

In  this  section  we  will  derive  an  entropy  measure  for  the  property  uncertainty 
Up.  We  proceed  by  assuming  that  the  observer  knows  that  an  object  belongs 
to  some  particular  category  Ci .  The  question  we  want  to  answer  is  how  much 
uncertainty  does  he  have  about  the  object’s  properties? 

There  are  two  ways  to  think  about  the  properties  of  objects.  The  first 
is  to  consider  the  property  vector  as  a  whole,  and  the  uncertainty  of  the 
properties  of  an  object  is  the  uncertainty  of  the  entire  property  vector.  The 
second  is  to  consider  each  component  independently.  To  decide  which  way 
is  appropriate  for  measuring  Up,  we  must  consider  the  tasks  of  the  observer 
for  which  the  property  information  is  useful. 


Figure  4.3:  Two  categories  with  their  objects  and  associated  property  vectors. 
If  the  components  of  the  property  vector  are  considered  independently  then  the 
uncertainty  of  Cb  is  greater  than  that  of  <70;  otherwise  they  are  equal. 


One  such  task  is  simply  needing  to  know  some  particular  property.  For 
example,  the  fact  that  something  is  hard  (therefore  can  be  stepped  on  safely) 
or  that  something  moves  (therefore  should  be  kept  at  a  safe  distance)  are 
properties  the  observer  might  want  to  know  directly.  Another  task,  and  one 
which  may  be  critical  for  a  reliable  perceptual  process,  is  the  ability  to  make 
predictions  about  as  yet  unobserved  but  potentially  observable  properties; 
these  predictions  are  necessary  for  the  verification  of  the  identity  of  an  object. 
In  both  of  these  cases,  it  is  the  separate  uncertainties  of  the  components  of 
the  property  vector  that  are  important.  Figure  4.3  illustrates  this  point.  For 
the  perceptual  goals  of  the  observer,  knowing  that  some  object  is  a  member 
of  category  Ca  provides  him  with  more  useful  information  than  knowing  that 
some  object  is  a  member  of  Cb-  However,  for  the  uncertainty  of  the  properties 
of  an  object  given  its  category  to  be  greater  for  Cb  than  for  Ca ,  then  it  must 
be  the  case  that  we  consider  the  properties  independently.  If  the  property 
vectors  axe  considered  in  their  entirety,  then  the  property  uncertainties  of 
Ca  and  Cb  would  be  equal. 

We  now  construct  a  measure  of  property  uncertainty  considering  the 
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uncertainty  of  each  of  the  features  independently.  First,  we  need  to  define 
the  uncertainty  of  a  feature  in  a  category.  To  reduce  the  complexity  of 
the  notation,  we  define  H(D)  to  be  the  entropy  of  any  finite  probability 
distribution  D: 

H(D)=-'£p^og2pi  (4.2) 

i= 1 

where  D  =  {pi,p2,  • . .  ,  p,},  Pi  >  0,  and  £)*_0p,  =  1-  For  the  remainder  of 
the  thesis  the  base  of  the  logarithm  will  be  omitted  from  the  expressions;  we 
will  always  assume  it  to  be  2. 

Now  let  us  define  the  distribution  of  a  feature  /,  in  some  category  Ca. 
Let  p\a  be  the  fraction  of  objects  in  Ca  whose  value  for  feature  f,  that  is 
equal  to  the  jth  value  in  the  range  of  fi8  Then  the  dist(f,)  in  Ca  is  the  set 
{Pt'a)Pia»-  •  -  Pi}  where  p  is  the  number  of  values  in  the  range  of  /,.  Using 
this  distribution  we  define  the  uncertainty  of  feature  /,  in  category  Ca  to  be 
H(dist(f,)  in  Ca). 

Having  defined  the  uncertainty  of  a  feature  in  a  category  we  can  define  our 
property  uncertainty  of  the  category  as  the  sum  of  the  feature  uncertainties: 

UP.0j-c{Ca)  =  £  H  (dist(f,)  in  Ca)  (4.3) 

/.€  F 

The  above  equation  provides  a  measure  to  answer  the  question  of  how  much 
uncertainty  about  an  object’s  properties  remains  once  that  object  's  category 
is  known.  To  compute  Up(Z)  we  must  extend  that  measure  to  provide  an 
evaluation  of  the  property  uncertainty  over  the  entire  categorization.  Let  n 
be  the  total  number  of  objects,  n  =  ||C,||.  Recalling  that  *P(#l )  represents 

the  category  to  which  object  0,  belongs,  we  define  Up  as  the  average  of 
Up-0f-c  as  summed  over  all  the  objects  in  the  categorization  Z : 

Up(Z)=-  Up-af-cm*))  (4-4) 

”  9,  in  Z 

Since  Up-0f-c  is  only  a  function  of  't(^)  and  not  of  6,  itself,  we  can  sum  over 
the  categories  instead  of  the  objects,  weighting  each  category  according  to 
its  size: 

8We  do  not  have  to  exclude  the  case  where  p  =  0  because,  by  L’Hospital’s  rule, 
limp_o  plogp  =  0. 


V-.-lv ■>>  >>>>>>? 


vvy.v.vy././..' 


•  V  V 
•» 


(4.5) 


Up(Z)  =  —  Y.  \muP.OJ-c(ct) 

n  C.ez 

This  second  form  is  computationally  less  intensive  and  is  the  form  used  in  the 
implementations  discussed  later  in  this  chapter.  We  postpone  discussion  of 
how  Up  behaves  in  ideal,  noise,  and  real  conditions  until  after  we  derive  Uc 
and  can  apply  a  total  uncertainty  function  to  both  artificial  and  real  data. 
Later,  we  shall  also  discuss  how  Up  compares  with  some  of  the  distance 
metrics  discussed  in  chapter  3. 


4.3.4  Measuring  Uc 

Having  proposed  a  measure  for  Up  we  must  now  provide  a  measure  for  Uc, 
the  category  uncertainty.  In  section  4.2.3,  we  stated  that  Uc  was  the  uncer¬ 
tainty  of  the  categorization  of  an  object  given  some  of  the  object’s  properties; 
we  must  now  make  that  loose  description  precise. 

To  begin,  let  us  assume  that  “some”  of  the  observed  properties  means 
exactly  what  it  says:  we  are  given  only  some  of  the  components  of  the 
property  vector  describing  some  object.  This  situation  would  arise  if  some  of 
the  (potentially)  observable  properties  could  not  be  recovered  in  the  current 
sensing  situation.  Consider  the  uncertainty  of  categorization  if  we  are  given 
this  incomplete  description  of  the  object  and  our  task  is  to  decide  to  which 
category  that  object  belongs.  To  determine  the  correct  category,  the  observer 
would  check  each  category  in  turn,  noticing  whether  there  are  objects  whose 
property  vector  matches  the  components  that  are  pi  ovided  for  the  object 
in  question.  If  only  one  category  contains  any  objects  that  match,  then 
there  is  no  uncertainty  of  categorization.  If,  however,  there  is  more  than 
one  category,  we  need  some  way  of  measuring  the  uncertainty  as  to  which 
category  the  object  belongs.9 

We  will  design  a  measure  of  category  uncertainty  by  assuming  that  the 
percentage  of  matches  that  a  partial  description  of  an  object  makes  to  a 
category  is  representative  of  the  probability  that  the  object  actually  belongs 
to  that  category.  For  example,  suppose  a  given  partial  description  of  object 


We  do  not  need  to  consider  the  case  of  an  object  not  matching  any  of  the  objects  in  the 
categories.  The  uncertainty  measure  is  designed  for  the  evaluation  of  a  categorization  in 
which  all  objects  of  the  population  have  been  categorized.  Thus  every  object  is  guaranteed 
to  match  at  least  one  object,  namely  itself. 


6k,  matches  4  objects  in  category  Ca ,  12  objects  in  category  Cj,,  and  no 
objects  in  any  remaining  category.  Then  we  say  the  probability  that  6k 
belongs  in  Ca  is  .25,  and  in  Cb  is  .75.  This  suggests  that  we  measure  the 
uncertainty  of  categorization  for  an  object  with  the  entropy  function  H . 

Let  F'  be  some  subset  of  the  set  of  features  F.  We  define  Match(0*,  Ca,  F') 
to  be  the  number  of  objects  in  category  Ca  whose  property  vectors  match 
that  of  9k  in  the  components  contained  in  F\  As  such  we  define  the  match 
probability  p\i(6k,Ca ,  F')  of  6k  in  Ca  on  F': 


Piui^k,  Ca,  F') 


Match^C^F') 
Ec,  MATCH(6k,Ci,F') 


(4.6) 


where  the  denominator  is  simply  the  sum  of  the  matches  over  all  the  cat¬ 
egories.  Given  the  match  probabilities,  we  define  the  match  distribution 
of  (6k,  F')  to  be  the  set  of  probabilities  Gi,  F'),  p\f(6k,  C2,  F'), . . . , 

Pxi(@k,  Cc,  F')}.  Finally,  we  can  define  the  category  uncertainty  for  a  given 
object  with  a  given  feature  subset  description: 


Uc-oj-e(6i,  F')  =  H(match  distributionoi (0,,F'))  (4.7) 

If  an  object  0,  matches  only  objects  in  one  category  in  the  features  of  F'  then 
the  uncertainty  Uc-o/-e  will  be  zero. 

Having  defined  the  category  uncertainty  for  one  object  over  one  subset  of 
the  features,  we  can  compute  the  category  uncertainty  Uc  for  a  categorization 
Z  by  averaging  over  all  objects  and  over  all  possible  subsets  of  F.  However, 
to  compute  such  an  average  we  must  take  into  account  the  probability  of 
having  a  particular  feature  subset  F'  available  for  a  particular  object  6,.  For 
a  given  object  one  set  of  properties  may  be  highly  salient  and  thus  likely  to 
be  viewed,  while  for  another  object  a  different  set  of  objects  may  be  more 
likely  available.  Thus,  we  define  the  quantity  ps(F\#,)  to  be  the  salience 
probability ,  where  ps(F',6i)  >  0,  E.  £f'  Ps(F',  6t)  =  1.  This  probability  is 
intended  to  reflect  the  likelihood  of  having  some  particular  subset  of  features 
(and  only  that  subset)  available  for  a  given  object.  Let  p(F)  be  the  power 
set  of  F  —  the  set  of  all  subsets  of  F.  Then,  using  the  salience  probability  as 
the  appropriate  weight  for  averaging  the  individual  category  uncertainties, 
we  get  the  following  expression  for  Uc(2): 
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Uc(2)  =  —  ^2  X  Ps(F',Qi)  H(match  distribution  of(9i,F'))  (4.8) 

”  9,  in  Z  F'€p(F) 

A  special  case  of  the  above  equation  occurs  if  one  assumes  that  the  salience 
probability  is  equal  for  all  feature  subsets  and  for  all  objects.  In  this  case, 
since  the  cardinality  of  p(F)equals  2^FN,  Uc(2 )  reduces  to: 

Uc(2)  — - X  X  # (match  distribution  of  (fl^F'))  (4.9) 

n  2  s,  in  Z  F'ep(F) 

The  above  equation  is  used  in  the  implementation  discussed  later  in  this 
chapter  and  in  subsequent  chapters.  This  special  case  was  employed  instead 
of  the  more  general  formulation  because  without  a  model  of  the  sensing 
apparatus  we  have  no  basis  for  assigning  the  salience  probabilities. 

As  a  final  comment  about  the  computation  of  Uc,  we  note  that  the  size 
of  the  power  set  p(F)  grows  exponentially  as  the  size  of  F  increases.  As 
||F|1  becomes  only  moderately  large  (only  15  or  so),  2^FN  becomes  compu¬ 
tationally  unmanageable,  since  each  of  the  subsets  would  be  evaluated  for 
each  object.  To  alleviate  this  problem,  an  algorithm  was  implemented  in 
which  not  all  possible  subsets  of  F  are  considered  for  each  object.  Rather, 
for  each  different  object  a  different  set  of  subsets  of  features  is  randomly 
chosen  for  the  computation.  The  number  of  feature  subsets  used  per  object 
can  be  varied,  trading  speed  for  accuracy.  In  the  examples  shown  in  this 
thesis,  the  sampling  method  was  used  exclusively.  A  comparison  made  be¬ 
tween  the  sampling  method  and  the  exhaustive  method  yielded  no  significant 
differences. 

We  should  note  that  the  strategy  of  sampling  the  feature  subsets  is  only 
valid  when  most  features  are  constrained  by  the  natural  classes.  Otherwise, 
there  is  a  high  probability  that  the  sampled  subsets  will  contain  no  useful 
information  about  the  category  to  which  an  object  belongs;  the  computation 
of  Uc  will  produce  erroneous  lesults.  If  we  require  that  such  a  sampling 
strategy  be  available  to  the  observer,  then  we  have  placed  an  additional 
requirement  on  the  representation:  the  representation  must  not  contain  too 
many  unconstrained  features.  Without  such  a  representation,  the  sampling 
strategy  observer  cannot  recover  the  natural  modes. 


At  the  end  of  the  next  section,  after  defining  the  total  uncertainty  of  a 
categorization,  we  will  analyze  the  behavior  of  Uc  and  compare  its  properties 
to  the  distance  metrics  criticized  in  chapter  3. 

4.4  Total  Uncertainty  of  a  Categorization 

Having  proposed  measures  for  both  Up  and  Uc,  we  must  now  introduce  the 
parameter  A  —  the  relative  weight  of  the  two  uncertainties  —  and  construct 
a  measure  for  the  total  uncertainty  U(Up,Uc,  A).  We  proceed  by  examin¬ 
ing  how  the  two  measures  Up  and  Uc  behave  under  both  ideal  and  pure 
noise  conditions.  By  combining  those  results  with  some  necessary  or  desir¬ 
able  properties  that  the  uncertainty  measure  should  exhibit,  we  restrict  how 
U (Up,  Uc,  A)  may  be  constructed.  We  then  show  that  a  simple  weighted  sum 
satisfies  these  constraints. 


4.4.1  Ideal  categories 

Our  first  consideration  is  how  Up  and  Uc  behave  in  an  ideal  world  where 
there  are  purely  modal  classes  and  features.  By  purely  modal  we  mean 
that  for  each  class,  each  feature  takes  on  a  distinct  value.10  Therefore,  the 
features  are  completely  predictive:  knowledge  of  one  feature  is  sufficient  to 
correctly  identify  the  class  allowing  the  prediction  of  all  other  features.  One 


If  the  current  definition  of  a  modal  world  appears  awkward,  it  is  because  we  have  just 
confronted  Watanabe’s  Ugly  Duckling  Theorem.  Notice  that  we  cannot  define  a  purely 
modal  world  without  making  reference  to  the  features.  Given  the  discussion  of  natural 
modes  in  chapter  2  one  would  like  to  be  able  to  say  that  some  world  is  modal,  independent 
of  the  features  used  to  describe  it.  Unfortunately,  as  demonstrated  by  Watanabe,  this  is 
impossible  without  restricting  the  properties  of  the  objects  that  may  be  used  to  describe 
the  objects.  For  example,  suppose  we  arbitrarily  partition  the  world  into  two  groups,  Ca 
and  Cb ■  Then,  let  us  define  a  set  of  features  F  such  that  for  every  /,•  €  F,  the  objects 
in  category  C0  take  on  the  value  1,  and  every  object  in  category  Cj  takes  the  value  0. 
(A  trivial  example  of  such  a  feature  is  “1  if  0,  E  Ca,  0  otherwise.)  Then,  as  described 
by  this  set  of  features,  the  world  would  be  purely  modal.  The  only  method  by  which  we 
can  say  there  exist  classes  in  the  world  is  by  restricting  the  properties  of  consideration 
to  be  those  that  are  of  importance  in  the  natural  environment.  We  will  return  to  this 
point  later  when  considering  how  the  evaluation  function  proposed  in  this  section  —  an 
evaluation  function  derived  from  the  goals  of  the  observer  —  relates  to  the  structure  of 
the  natural  world. 
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possible  taxonomy  of  such  a  world  is  shown  in  Figure  4.4.  In  this  case  there 
are  four  classes  of  objects  in  the  world:  fia,  Qc,  Q,d.  The  taxonomy 
is  constructed  such  that  at  level  2,  the  categorization  formed  by  the  four 
categories  corresponds  exactly  to  the  4  classes  in  the  world.  At  that  level,  all 
objects  in  a  category  have  the  same  property  vector,  and  across  categories 
each  feature  takes  on  a  different  value.  Because  the  features,  /,,  used  to 
represent  the  objects  are  purely  modal  there  are  4  values  for  each  feature, 
one  value  corresponding  to  each  class. 

The  graphs  of  Figure  4.5  are  the  results  of  evaluating  Up  and  Uc  for 
each  of  the  different  categorizations  corresponding  to  a  different  level  in  the 
taxonomy.  In  these  and  subsequent  graphs,  the  abscissa  indicates  the  depth 
in  the  taxonomy.  A  depth  of  zero  corresponds  to  the  root  node,  where  the 
categorization  contains  only  one  category,  THING.  The  depth  of  d  (where  d 
is  the  deepest  level  of  the  taxonomy  and  d  =  logn)  corresponds  to  the  case 
when  each  object  is  its  own  category.  Notice  that  Up  decreases  (linearly) 
from  the  root  node  to  level  2.  At  the  root  node,  all  the  objects  are  in  one 
category,  and  each  feature  can  take  on  one  of  four  values;  therefore  Up  = 
m  ■  log  4  =  2m,  where  m  is  the  number  of  features.  At  level  1,  there  are  only 
2  possible  values  for  each  feature  in  each  category;  thus  Up  =  m  ■  log  2  =  m. 
Finally,  at  level  2,  each  feature  is  fixed  to  some  value  (in  this  perfectly  modal 
situation)  and  there  is  no  uncertainty  about  an  objects  properties  once  its 
category  is  known.  Therefore,  at  level  2  and  all  subsequent  levels  Up  —  0. 

The  behavior  of  Uc  may  be  viewed  as  the  inverse  of  Up.  Uc  measures 
the  difficulty  in  identifying  an  object’s  category  given  some  of  its  properties. 
In  a  perfectly  modal  world  however,  if  no  two  categories  contain  objects 
belonging  to  the  same  real  class,  then  knowledge  of  any  property  of  an  object 
is  sufficient  information  to  recover  the  category.  This  can  be  seen  at  levels 
0,  1,  and  2  in  the  graph  of  Uc-  At  level  0  all  objects  are  in  one  category  and 
therefore  there  is  no  uncertainty  as  to  an  objects  category.  At  level  1,  the 
two  categories  do  not  contain  any  elements  of  a  common  class:  fla  and  f2(, 
are  in  one  category;  and  Q,d,  another.  Thus  knowledge  of  any  property 
of  an  object  is  still  sufficient  to  recover  its  category.  Similarly  for  level  2, 
each  class  is  in  its  own  category  and  there  is  still  no  uncertainty  about  the 
category.  It  is  only  when  level  3  is  reached,  where  the  classes  are  split  among 
two  categories,  that  any  uncertainty  arises.  As  the  members  of  each  class  are 
divided  among  more  and  more  categories,  Uc  continues  to  increase  (linearly). 
At  the  finest  categorization,  at  depth  d ,  each  object  matches  an  object  in 
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nf  4  categories.  Thus  the  maximum  value  for  Uc  is  log(n/4)  =  (d  —  2). 

Before  proceeding  to  the  next  section,  we  should  note  that  by  defining  the 
“ideal”  category  case  we  have  implicitly  defined  natural  classes  to  be  those 
that  are  highly  redundant  and  non-overlapping  in  the  space  of  “important” 
properties.  We  will  return  to  this  point  when  we  consider  how  the  evaluation 
function  derived  in  this  section  relates  to  the  structure  of  the  natural  world. 
For  now,  we  note  that  the  discovery  of  categories  that  behave  similarly  to 
these  ideal  categories  would  permit  the  observer  to  accomplish  his  goals 
of  inference.  The  set  of  categories  corresponding  to  level  two  in  the  ideal 
taxonomy  supports  the  reliable  categorization  of  objects  as  well  as  strong 
inferences  about  the  properties  of  an  object  once  its  category  is  known.  The 
measure  we  construct  of  the  total  uncertainty  of  a  categorization  should  be 
sensitive  to  categories  of  this  form. 

4.4.2  Random  categories 

We  refer  to  a  set  of  categories  in  which  the  features  are  completely  inde¬ 
pendent  of  the  categories  as  a  random  categorization.  A  simple  way  to  con¬ 
sider  random  categorizations  is  to  construct  a  random  taxonomy,  where  the 
grouping  of  objects  into  a  hierarchy  is  achieved  arbitrarily  (Figure  4.6).  If 
we  evaluate  Up  and  Uc  at  the  different  levels  of  this  taxonomy,  we  would 
get  the  graphs  of  Figure  4.7.  Up  remains  constant  until  the  number  of  cat¬ 
egories  becomes  large  and  the  each  category  no  longer  contains  a  statistical 
sample  of  the  different  classes  of  objects  in  the  world.  Similarly,  Uc  increases 
monotonically,  though  the  rate  decreases  as  the  sampling  is  spread  too  thin. 
These  graphs  were  derived  experimentally  through  simulations. 

The  reason  it  is  important  to  consider  the  random  taxonomy  is  that  such 
a  set  of  categorizations  represents  no  structure  in  the  data.  The  categoriza¬ 
tion  is  useless  for  making  any  predictions.  Recall  that  one  of  major  criticisms 
of  the  standard  cluster  analysis  paradigm  is  the  inability  to  determine  the 
cluster  validity.  Even  if  there  are  no  clusters  present  in  the  data,  the  cluster 
analysis  programs  are  obligated  to  “discover”  categories.  By  requiring  that 
our  uncertainty  measure  have  a  certain  pathological  behavior  in  the  case 
where  there  is  no  structure  in  the  data,  we  will  provide  a  mechanism  by 
which  we  can  determine  when  discovered  categories  are  indeed  valid.  Note 
the  particular  form  of  random  categorizations  used  here  is  only  one  (rather 
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Figure  4.6:  An  random  taxonomy.  The  taxonomy  is  created  by  creating  a  random 
hierarchy  of  objects  drawn  from  the  four  classes  ft„  - 


restrictive)  model  of  the  absence  of  structure.  In  chapter  6  we  will  con¬ 
sider  an  alternative  form  in  which  the  observer  has  attempted  to  form  the 
best  taxonomy  possible  in  a  world  which  has  uniformly  and  independently 
distributed  features. 

4.4.3  Defining  U(Up,  Uc,  A) 

To  construct  the  uncertainty  function  U ,  we  will  first  present  several  con¬ 
straints  that  U  must  satisfy.  Then,  we  will  propose  a  simple  measure  con¬ 
sistent  with  these  constraints.  All  of  the  constraints  are  expressed  in  terms 
of  evaluating  the  levels  of  a  taxonomy;  the  term  “preferred  categorization” 
refers  to  the  set  of  categories  selected  when  the  taxonomy  level  that  yields 
the  lowest  value  of  U  is  chosen.11  Two  of  the  constraints  will  be  based  upon 
the  behavior  of  Up  and  Uc  as  described  in  the  previous  section. 

The  first  two  constraints  describe  the  behavior  of  U  at  the  extreme  values 
of  A: 

1.  When  A  =  0,  the  preferred  categorization  should  be  the  finest,  with  each 

object  in  its  own  category.  This  should  be  true  for  all  possible  tax¬ 
onomies. 

2.  When  A  =  1,  the  preferred  categorization  should  be  the  coarsest,  with 

all  objects  in  one  category.  Again,  this  should  be  true  for  all  possible 
taxonomies. 

Another  way  of  expressing  the  first  constraint  is  that  when  A  =  0,  the  mea¬ 
sure  U  should  have  no  Uc  terms,  and  the  preferred  categorization  would  be 
that  which  minimizes  Up.  The  second  constraint  would  correspond  to  U  be¬ 
ing  independent  of  Up  when  A  =  1.  These  constraints  also  combine  to  give  A 
the  intuitive  meaning  of  being  a  relative  weight  between  the  two  component 
uncertainties. 

The  next  constraint  expresses  the  desired  behavior  of  U  under  the  purely 
modal  conditions: 

3.  In  the  purely  modal  taxonomy,  the  preferred  categorization  for  0  <  A  <  1 

should  be  that  which  corresponds  to  a  separate  category  for  each  class 
of  objects  in  the  world. 
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For  example,  in  the  taxonomy  of  Figure  4.4,  the  preferred  categorization 
should  be  level  2,  where  each  ft,  is  in  its  own  category.  This  constraint 
states  that  level  2  should  be  preferred  for  all  A  not  at  the  extremes.  If  both 
Up  and  Uc  have  a  non-zero  contribution  to  U,  then,  in  an  ideally  modal 
world,  the  best  categorization  is  that  which  selects  the  modal  classes. 

We  also  wish  to  constrain  the  behavior  of  U  in  the  random  condition  of 
the  taxonomy  of  Figure  4.6.  Intuitively,  we  desire  that  the  behavior  of  U  in 
the  random  condition  be  predictable  so  that  we  can  determine  when  we  are 
evaluating  a  non-structured  set  of  categories.  We  will  impose  this  restriction 
in  the  following  way: 

4.  In  the  random  taxonomy  (which  contains  no  useful  categories),  the  pre¬ 

ferred  categorization  should  be  either  the  finest  or  the  coarsest,  de¬ 
pending  on  A. 

That  is,  for  each  A,  the  value  of  U  should  be  a  minimum  at  one  of  the 
extreme  levels  of  categorization.  Unfortunately,  this  constraint  can  not  be 
fully  discussed  until  we  present  the  concept  of  lambda-space  in  the  next  sec¬ 
tion.  At  that  time,  we  will  provide  the  intuition  behind  this  constraint.  For 
now  we  only  state  that  a  random  taxonomy  contains  no  useful  intermediate 
structure  and  thus  no  intermediate  level  should  be  preferred. 

Finally  we  include  an  constraint  which  allows  us  to  compare  one  catego¬ 
rization  to  the  next  in  a  meaningful  manner  and  which  allows  us  to  interpret 
A  as  a  relative  weight  between  Up  and  Uc' 

5.  U  should  be  normalized  with  respect  to  the  number  of  objects  contained 

in  the  categorization. 

If  we  were  strictly  adhering  to  the  definitions  provided  at  the  beginning  of 
this  chapter  we  would  not  need  to  be  concerned  with  normalization:  every 
categorization  is  a  partitioning  of  a  fixed  population.  However,  in  the  next 
chapter  we  will  utilize  the  measure  developed  here  in  a  dynamic,  incremental 
categorization  method.  Thus  we  need  to  be  able  to  normalize  for  the  number 
objects  contained  in  a  categorization.  Also,  to  interpret  A  as  the  relative 
weight  between  Up  and  Uc  we  must  make  their  scales  commensurate. 

Combined,  these  constraints  restrict  the  functional  form  of  U\  we  shall 
propose  a  simple  measure  for  U  which  satisfies  these  five  constraints.  Af¬ 
terwards,  we  will  compare  this  measure  with  some  of  the  category  metrics 
discussed  in  chapter  3. 
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We  first  need  to  introduce  a  normalization  coefficient  which  will  make 
the  measure  independent  of  the  number  of  objects  in  a  categorization.  Note 
that  given  “enough”  objects  per  category,  Up  is  independent  of  the  number 
of  objects  in  a  categorization,  since  it  depends  only  on  the  entropy  of  the 
properties.  Uc,  however,  may  depend  critically  on  the  number  of  objects: 
given  more  more  objects  we  can  create  more  categories  and  make  the  num¬ 
ber  of  possible  category  matches  of  an  object  be  arbitrarily  large.  Therefore 
we  need  to  scale  Uc  appropriately  for  the  number  of  objects.  Also,  though 
by  design  both  Up  and  Uc  are  unitless  (or  sometimes  said  to  be  in  units  of 
information  referred  to  as  bits),  they  are  not  of  the  same  range.  The  maxi¬ 
mum  value  for  Up  is  unrelated  to  the  maximum  value  for  Uc-  Therefore  to 
make  them  commensurate  we  will  scale  the  normalized  Uc  by  the  maximum 

Up- 

We  compute  the  normalization  coefficient  as  follows:  Suppose  we  are 
given  some  categorization  Z  to  evaluate.  Let  us  construct  two  new  catego¬ 
rizations  from  Z.  Define  Coarsest(Z)  to  be  the  categorization  formed  by 
placing  all  the  objects  of  Z  in  one  category.  Analogously,  define  Finest(Z) 
to  be  the  categorization  formed  by  placing  all  the  objects  in  Z  into  separate 
categories.  We  define  a  normalization  factor  r]{Z): 


tj(Z)  = 


Up(  Coarsest(Z) ) 


(4.10) 


,v  '  Uc(  Finest(Z) )  v  '  ' 

By  dividing  by  Uc(  Finest(Z) )  we  compensate  for  the  number  of  objects; 
the  numerator  makes  the  scale  the  same  as  that  of  Up. 

Finally,  we  can  propose  our  measure  for  the  total  uncertainty  of  a  cate¬ 
gorization: 

U(Z)  =  (1  -  X)Up(Z)  +  \r1(Z)Uc(Z)  (4.11) 

The  total  uncertainty  of  a  categorization  is  simply  the  weighted  sum  of  Up 
and  Uc,  where  Uc  has  been  scaled  to  be  of  the  same  range  as  Up\  the 
parameter  A  controls  the  relative  weights. 

It  is  easily  shown  that  equation  4.11  satisfies  the  behavioral  constraints 
1-4.  Constraint  1  holds  because  when  A  =  0,  U(Z)  =  Up,  and  Up  is  at 
a  minimum  (in  fact  zero)  at  the  finest  categorization  where  each  object  is 
in  its  own  category.  Constraint  2  follows  analogously.  Constraint  3  is  also 
satisfied:  if  0  <  A  <  1,  then  U  is  at  a  minimum  (zero)  when  both  Up  and 
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Uc  are  zero.  As  shown  in  Figure  4.5,  this  occurs  only  at  the  desired  natural 
class  categorization.  In  fact,  since  zero  is  the  absolute  minimum  for  U,  in 
an  ideally  modal  world,  the  categorization  which  corresponds  to  the  natural 
classes  is  the  best  possible  categorization,  not  simply  the  best  level  in  some 
taxonomy. 

Constraint  4  holds  because  of  the  concavity  of  both  Up  and  Uc  in  the 
random  condition.  Because  the  linear  sum  of  two  concave  functions  is  also 
concave  (Luenburger,  1986),  U  is  guaranteed  concave  for  the  random  case. 
Therefore,  for  all  A.  the  minimum  of  U  is  at  one  of  the  extreme  levels  of  the 
taxonomy. 

Of  course,  one  could  construct  a  more  complicated  measure;  given  that 
the  proposed  measure  satisfies  the  imposed  constraints  and  that  it  is  similar 
to  standard  functionals  for  combining  constraints,  there  is  no  apparent  reason 
to  do  so.  In  the  next  section  we  will  compare  the  properties  of  U(Z)  with 
some  of  the  distance  metrics  discussed  in  chapter  3. 

4.4.4  Uncertainty  as  a  metric 

Having  provided  a  formal  definition  for  the  uncertainty  of  a  categorization, 
we  can  now  compare  this  function  to  the  distance  metrics  discussed  in  chapter 
3.  Specifically,  we  should  address  the  criticisms  raised  concerning  the  use  of 
distance  metrics  to  define  object  categories. 

First,  notice  that  although  we  do  not  explicitly  define  the  distance  be¬ 
tween  two  objects,  the  property  uncertainty  functions  do  provide  an  implicit 
measure  for  comparing  two  objects.  In  particular,  the  entropy  function  im¬ 
poses  a  Hamming-like  distance  between  objects  since  the  entropy  measures 
are  sensitive  to  the  exact  matches  between  feature  values.  As  mentioned 
in  chapter  3  such  measures  are  sensitive  to  the  resolution  of  the  features. 
For  example,  if  a  feature  is  continuously  valued  (e.g.  “length”)  and  is  his- 
togrammed  into  fine-grained  buckets,  then  all  objects  will  take  on  a  different 
value.  In  this  case  the  feature  will  be  able  to  convey  no  information  about 
classes  in  the  data.  Thus,  using  these  entropy  measures  requires  that  some 
of  the  features  of  the  representation  be  suitably  chosen  to  convey  the  dis¬ 
tinctions  between  different  classes  of  objects. 

For  two  reasons,  the  above  restriction  does  not  significantly  reduce  t1  e 
utility  of  the  categorization  evaluation  function.  First,  it  is  necessarily  true 
that  the  observer  must  encode  relevant  sensory  information  if  he  is  to  dis- 
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cover  natural  classes  of  objects.  If  the  only  properties  of  objects  used  for 
categorization  were  those  completely  independent  of  the  “type”  of  object, 
then  no  interesting  properties  could  be  predicted  from  those  observations.  If 
we  invoke  the  power  of  evolution  in  the  design  of  the  observer,  then  we  expect 
that  the  observer  would  be  provided  with  a  set  of  features  sufficient  to  deter¬ 
mine  the  true  class  of  the  object  and  thereby  granting  him  the  ability  to  form 
an  appropriate  set  of  categories.  In  chapter  6  we  will  demonstrate  how  the 
observer  could  use  an  initial  set  of  useful  features  to  evaluate  the  predictive 
power  of  a  new  feature.  But,  a  sufficient  initial  set  must  be  provided. 

Second,  the  nature  of  the  entropy  measures  is  that  not  all  of  the  features 
need  be  constrained.  Unlike  standard  distance  metrics  where  long  inter¬ 
object  distances  in  a  few  dimensions  can  mask  clustering  in  other  dimensions, 
entropy  measures  are  statistical  in  nature  and  and  can  detect  structure  in 
separate  dimensions.  Up  was  designed  to  treat  the  features  independently 
and  Uc  considers  object  matches  along  different  subsets  of  features.  We  can 
demonstrate  this  behavior  by  reconsidering  the  ideal  taxonomy  of  Figure  4.4. 
Let  us  assume  we  have  the  same  taxonomy  of  objects  and  the  same  modal 
features.  However,  this  time  we  shall  include  several  noise  features  —  fea¬ 
tures  whose  values  are  independent  of  the  object  class.  Figure  4.8  displays 
the  results  of  evaluating  Up  and  Uc  for  the  different  levels  of  a  four  class  tax¬ 
onomy.  Notice  that  both  uncertainties  are  no  longer  zero  at  the  modal  level; 
the  increase  in  uncertainty  is  caused  by  the  noise  features.  However,  there 
is  still  a  significant  change  in  the  behavior  of  both  Up  and  Uc  at  level  two. 
As  the  the  number  of  noise  features  is  increased  the  change  in  the  slope  of 
the  curves  diminishes;  when  there  are  many  more  noise  features  than  modal 
features  the  graphs  approach  those  the  random  taxonomy  in  Figure  4.7. 

The  ability  to  still  detect  structure  in  the  presence  of  unconstrained  fea¬ 
tures  also  allows  the  entropy  measure  to  be  used  when  different  features  are 
constrained  for  different  classes  of  objects.  Therefore,  there  is  no  require¬ 
ment  of  using  only  features  constrained  in  all  classes.  Metrics  based  on  the 
within-class  and  between-class  scatter  matrices  are  unreliable  in  the  presence 


v'vs 

v'v 

■>> 


SvSS 


"■'e 

y./v 

,*\"V 

,«vv 


a 


V-  %*  %• 

■  V  .w  .-  v.v 


Figure  4.8:  The  graphs  of  Up  and  Uc  for  an  ideal  taxonomy  of  four  classes,  but 
with  noise  features  added.  At  level  2,  the  categories  of  the  taxonomy  correspond 
exactly  to  the  real  classes  in  this  modal  world.  With  the  addition  of  noise,  the 
curves  are  no  longer  the  simple  linear  functions  of  the  pure  modal  case,  but  there 
is  still  a  definite  break  at  the  natural  level. 
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of  unconstrained  dimensions.  Furthermore,  the  need  to  modify  the  distance 
metric  as  one  moves  from  one  region  of  feature  space  to  another  becomes 
less  pressing  since  the  measure  can  respond  to  one  set  of  properties  for  one 
class,  and  another  for  a  different  class.  Thus  using  entropy  functions  elim¬ 
inates  several  of  the  difficulties  associated  with  using  distance  metrics  for 
categorization. 

We  should  note  that  the  total  uncertainty  evaluation  function  derived  in 
this  section  is  analogous  to  the  modified  measure  of  collocation  discussed  in 
chapter  3.  Had  we  combined  the  measures  Up  and  Uc  via  an  exponentiated 
product  (Up~X)  Uq)  we  would  have  produced  a  measure  directly  related  to 
the  extended  collocation  function:12 

K'o,j.  =  P(,Cj\fi)x  ■  PUdCj'-" 

Thus,  we  have  incorporated  the  lesson  of  basic  level  categories  in  our  measure 
of  uncertainty:  a  measure  designed  to  evaluate  basic  categories  must  consider 
both  the  cue  validity  of  the  features  and  the  internal  similarity  of  categories. 

Finally,  we  note  that  the  category  evaluation  function  derived  in  the 
previous  section  explicitly  measures  how  well  the  observer  can  accomplish 
his  goals  of  inference.  Recall  that  one  of  the  criticisms  of  the  cluster  analysis 
paradigm  was  that  it  made  no  sense  to  consider  the  utility  of  the  recovered 
classes.  By  definition  the  classes  recovered  were  those  which  minimized  the 
particular  evaluation  function.  Whether  these  categories  were  appropriate 
for  some  task  depended  on  how  well  the  requirements  of  the  task  mapped 
onto  the  clustering  criteria  used.  In  our  case,  we  have  constructed  a  criteria 
that  directly  measures  the  utility  of  a  categorization  for  the  task  of  making 
inferences  about  objects.  If  one  believes  that  the  goal  of  object  recognition  is 
to  make  inferences  about  objects,  then  the  set  of  categories  selected  by  the 
categorization  criteria  U(Z)  is  appropriate  for  recognition. 

4.5  Natural  Categories 

The  evaluation  of  a  categorization  proposed  in  the  previous  section  is  based 
upon  the  goals  of  the  observer;  a  categorization  which  has  a  “low”  measure 

12Since  both  Up  and  Uc  can  be  zero,  this  particular  expression  would  be  ill-defined  unless 
other  parameters  or  constants  are  added. 
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of  uncertainty  U  should  permit  the  observer  to  perform  his  necessary  infer¬ 
ence  tasks  successfully.  Furthermore  we  have  constructed  the  measure  such 
that  in  an  suitably  defined  ideal  world,  the  measure  prefers  a  modal  cate¬ 
gorization  over  any  other.  However,  we  have  yet  to  mention  the  world  of 
objects  which  the  observer  is  going  to  categorize.  In  section  2  we  introduced 
the  Principle  of  Natural  Modes  a s  the  basis  for  the  categorization  process. 
The  claim  was  made  that  the  reason  it  is  plausible  that  the  observer  could 
predict  unobserved  properties  from  observed  properties  was  that  there  were 
constraints  acting  in  the  world  which  caused  redundancies  between  observed 
and  unobserved  properties  to  be  present.  How  do  we  incorporate  the  idea  of 
natural  classes  into  our  evaluation  of  categorizations? 

4.5.1  Natural  classes  and  natural  properties 

The  first  question  to  be  considered  is  how  does  the  proposed  evaluation 
function  behave  if  the  categorization  being  measured  does  reflect  the  natural 
modes  present  in  the  world.  Recall  that  U  was  constructed  such  that  in  an 
ideally  modal  world,  the  categorization  corresponding  to  the  natural  classes 
would  be  the  preferred.  In  the  modal  world,  each  feature  was  completely 
diagnostic.  Whether  the  proposed  evaluation  function  will  be  able  to  capture 
the  structure  present  in  the  natural  world  will  depend  upon  the  diagnosticity 
of  the  chosen  representation.  That  is,  the  representation  must  be  chosen  such 
that  the  constraints  imposed  by  the  natural  object  processes  are  reflected  in 
the  properties  measured. 

An  example  will  help  to  illustrate  this  point.  Consider  the  case  where  we 
have  some  class  of  objects  where  the  aspect  ratio  (ratio  between  the  length 
and  the  width)  is  fixed  by  the  process  which  generates  that  class.  Suppose 
that  both  “length”  and  “width”  are  features  measured  about  the  object,  but 
that  aspect  ratio  was  not.  Our  measure  of  total  uncertainty  would  not  be 
sensitive  to  this  constraint  being  present  in  the  class.  If,  however,  aspect 
ratio  was  a  feature,  then  this  constraint  would  be  reflected  in  the  measure 
of  both  property  and  category  uncertainty;  categorizing  that  particular  class 
of  objects  separately  would  reduce  the  uncertainty  measure.  Thus  we  rely 
on  the  choice  of  features  (which  define  those  properties  which  are  observed) 
being  appropriate  for  measuring  the  constraint  that  is  found  within  classes. 

Note,  that  we  although  require  that  the  representation  be  sufficiently 
restricted  so  that  the  differences  between  classes  are  made  explicit,  we  do 
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not  require  that  irrelevant  information  be  prohibited  from  the  representation. 
Recall  the  graphs  of  figure  Figure  4.8.  These  results  demonstrate  that  the 
uncertainty  measure  can  still  detect  the  ideal  modal  structure  even  when  a 
significant  portion  of  the  property  description  is  generated  randomly.  When 
we  test  the  evaluation  function  on  real  data  in  the  next  section,  we  will 
discover  that  real  property  descriptions  behave  in  a  manner  similar  to  those 
produced  with  modal  and  noise  features. 

4.5.2  Lambda  stability  of  natural  categories 

One  might  consider  a  sufficiently  low  measure  of  total  uncertainty  U  to  be 
simple  indicator  of  a  correct  natural  mode  categorization.  Unfortunately, 
this  requires  having  some  absolute  metric  of  uncertainty  for  categorizations. 
For  example,  consider  again  the  taxonomy  of  Figure  4.1,  and  recall  that  we 
are  considering  the  categorizations  which  correspond  to  the  different  levels 
in  the  tree.  Let  us  assume  that  for  some  A  =  A0  the  third  level  yielded  the 
least  total  uncertainty  U.  How  can  we  know  whether  this  level  reflects  modal 
structure  in  the  objects  of  the  taxonomy,  or  if  this  is  simply  some  arbitrary 
categorization  which  just  happens  to  evaluate  to  the  lowest  uncertainty  for 
the  given  A?  This  question  is  analogous  to  the  question  of  cluster  validity 
raised  in  chapter  3. 

Let  us  assume  that  we  have  been  given  a  taxonomy  such  as  Figure  4.1 
and  that  for  some  discretized  range  of  A,  0  <  A  <  1,  we  have  selected  the 
categorization  corresponding  to  the  level  in  the  taxonomy  which  minimizes 
the  total  uncertainty.  We  can  plot  the  results  of  this  procedure  in  a  lambda 
space  diagram  as  illustrated  in  Figure  4.9.  Notice  that  for  A  =  0  the  best 
categorization  is  that  which  places  all  the  the  objects  in  their  own  category. 
Likewise,  A  =  1  selects  the  top  level,  where  the  only  category  is  THING. 
The  question  we  must  consider  is  how  does  the  selected  level  change  as  A 
varies?  By  design,  we  know  that  in  the  ideally  modal  case  the  categorization 
corresponding  to  the  modal  classes  will  be  preferred  for  all  A,  0  <  A  <  1. 
But  what  about  “real”  natural  classes? 

In  Figure  4.9  the  hypothetical  behavior  is  that  over  some  (wide)  range  of 
A,  the  preferred  categorization  remains  the  same.  We  refer  to  this  behavior 
as  X-stability.  The  occurrence  of  A-stability  indicates  that  the  categories  se¬ 
lected  for  that  range  of  A  are  robust  with  respect  to  deviations  in  the  relative 
weight  between  property  uncertainty  and  category  uncertainty.  Therefore, 
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Figure  4.9:  Stable  A-space  diagram.  For  each  A  in  tome  discrete  range  between  0 
and  1,  the  level  of  the  taxonomy  is  selected  which  minimizes  the  total  uncertainty. 
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they  represent  an  actual  structuring  of  the  objects,  not  an  arbitrary  min¬ 
imization  of  the  uncertainty  function.  If  the  categorization  was  just  the 
arbitrary  minimum,  we  would  expect  that  the  preferred  level  would  change 
as  variations  in  A  drive  the  minimum  solution  toward  either  of  the  two  ex¬ 
tremes.  In  chapter  6  we  will  consider  the  question  of  A-stability  in  greater 
detail  and  will  consider  how  one  can  use  the  measure  of  stability  as  a  tool  to 
accomplish  other  important  tasks  related  to  the  categorization  process,  e.g. 
evaluating  the  utility  of  a  new  feature. 

Another  possible  behavior  as  A  varies  is  illustrated  in  Figure  4.10.  In  this 
case,  the  only  stable  points  are  the  two  extreme  categorizations.  Recall  that 
the  fifth  constraint  on  the  construction  of  the  total  uncertainty  function  U 
was  that  if  there  was  no  internal  internal  structure  of  a  taxonomy,  then  the 
categorization  which  should  be  preferred  should  be  one  of  the  two  extremes. 
The  intuition  behind  this  constraint  is  the  following:  Consider  a  taxonomy 
which  is  created  randomly.  Therefore  in  terms  of  the  uncertainties  measured, 
each  level  of  aggregation  represents  the  same  trade-off  between  category  ho¬ 
mogeneity  and  category  overlap,  between  property  uncertainty  and  category 
uncertainty.  Now  let  us  describe  A  as  a  pressure  to  move  up  the  taxonomy. 
The  larger  A  gets,  the  easier  it  is  to  trade  the  gain  in  property  uncertainty 
for  the  reduction  in  category  uncertainty  which  occurs  when  categories  are 
merged.  When  A  starts  at  0,  the  preferred  categorization  is  the  finest  par¬ 
tition,  with  each  object  in  its  own  category.  As  we  initially  increase  A  the 
preferred  level  of  categorization  does  not  change  because  there  is  insufficient 
pressure  to  overcome  the  increase  in  property  uncertainty  which  occurs  by 
randomly  combining  objects.  Eventually,  however,  A  is  great  enough  that 
the  first  level  of  merging  takes  place.  But,  as  stated,  in  a  random  taxon¬ 
omy  each  level  of  merging  is  the  same  amount  of  trade-off  between  property 
uncertainty  and  category  uncertainty.  Therefore  once  A  is  great  enough  to 
prefer  level  d  —  1  over  level  d ,  it  is  great  enough  to  prefer  level  d  —  2  over 
d  —  1,  continuing  until  level  0  is  reached.  Therefore,  at  some  critical  A  the 
preferred  level  of  categorization  moves  immediately  from  the  lowest  level  to 
the  highest  level. 

As  mentioned  earlier,  the  noise  taxonomy  is  only  one  possible  null  hy¬ 
pothesis  about  the  absence  of  structure  in  a  set  of  categories.  In  chapter  6 
we  will  again  return  to  the  question  of  category  validity,  and  compare  the 
results  of  ideal  (purely  modal)  worlds,  real  worlds,  and  a  different  case  of  a 
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Figure  4.10:  Degenerate  A-space  diagram.  The  only  preferred  level*  of  cate¬ 
gorization  are  the  two  extremes  indicating  no  coherent  internal  structure  of  the 
taxonomy. 


noise  world.  At  this  point  we  have  a  certain  degree  of  confidence  that  the 
total  uncertainty  evaluation  function  will  be  useful  for  measuring  how  well  a 
categorization  reflects  the  natural  mode  classes.  Let  us  now  test  the  function 
on  some  real  data  taken  from  the  domain  of  leaves. 

4.6  Testing  the  measure 

To  test  the  behavior  of  the  uncertainty  measure  U  of  equation  4.11  we  need 
a  sample  domain  of  objects  which  satisfies  the  criteria  of  having  well  defined 
natural  classes.  Of  course,  this  criterion  implies  a  previously  agreed  upon 
method  of  categorization  which  produces  natural  categories.  Therefore  we 
must  make  the  a  priori  assumption  that  the  science  which  studies  the  domain 
establishes  a  baseline  to  which  we  compare  our  evaluation.  The  validity  of 
this  assumption  depends  upon  how  well  the  science  understands  the  processes 
which  determine  the  structure  of  the  objects  in  the  domain. 

4.6.1  Properties  of  leaves 

The  sample  domain  used  is  that  of  leaves.  The  categorization  of  trees  ac¬ 
cording  to  their  leaves  is  a  well  developed  discipline,  and  there  exist  agreed 
upon  categories.  The  source  of  the  leaf  data  is  Preston’s  North  American 
Trees  [Preston,  1976]. 

In  order  to  apply  the  uncertainty  measure  to  our  domain,  we  must  create 
a  property  based  description  of  the  leaves.  But  which  properties  should  be 
used?  Are  arbitrary  features  permissible,  or  should  our  choices  be  somehow 
restricted?  To  proceed  we  must  delineate  some  criteria  by  which  to  choose 
our  feature  set. 

The  first  restriction  we  will  impose  is  (well-defined)  computability.  By 
this  restriction  we  mean  that  if  some  property  is  going  to  be  included  in 
the  representation  for  leaves,  then  one  must  be  able  to  provide  a  plausible 
method  for  computing  this  property  directly  from  the  the  physical  structure 
of  the  leaf.  The  reason  that  this  restriction  is  important  is  that  otherwise  the 
property  “oak-ness”  —  how  much  the  leaf  looks  like  an  oak  leaf  —  would  be 
an  acceptable  property.  If  such  features  are  permitted,  then  categorization 
reduces  to  providing  such  features  which  are  characteristic  functions  for  each 
category.  As  Fodor  has  commented:  “if  being-my-grandmother  is  a  legiti- 
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mate  feature  then  it’s  pretty  clear  how  to  do  object  recognition.”  [Personal 
communication.]  In  our  leaf  example  we  will  further  restrict  our  properties 
by  requiring  them  to  be  computable  from  information  recoverable  from  an 
image  of  the  leaf,  precluding  features  such  as  stickiness  or  scent.  Our  reason 
for  doing  so  is  simply  that  visual  categorization  is  of  primary  interest. 

We  note  here  that  in  the  following  examples,  we  did  not  actually  pro¬ 
vide  the  system  with  a  sensory  input  (e.g.  images).  Rather,  after  deciding 
which  features  were  to  be  used,  property  vectors  were  given  directly.  The 
motivation  for  eliminating  the  property  computation  step  is  that  we  are  not 
interested  in  how  well  we  can  provide  algorithms  capable  of  measuring  the 
properties.  Our  interest  lies  in  seeing  how  well  these  properties  can  be  used 
to  measure  structure  in  categories. 

Having  restricted  our  properties  to  being  well-defined  computations,  we 
still  have  to  choose  which  of  these  properties  should  be  used.  For  our  first 
examples  we  will  use  the  features  normally  used  by  tree  entymologists  to 
classify  leaves.  By  using  this  set  we  are  guaranteed  to  have  a  set  of  fea¬ 
tures  which  contain  sufficient  information  to  distinguish  between  the  classes 
of  leaves.  Of  course,  these  features  tend  to  be  highly  diagnostic  as  they  are 
used  by  botanists  for  the  express  purpose  of  classification;  however,  some  of 
the  features  overlap  the  species  considerably  (e.g.  “length”)  .  Also,  we  will 
consider  the  case  of  adding  some  noise  features  to  the  descriptions:  features 
whose  values  are  independent  of  the  type  of  leaf.  Those  results  will  demon¬ 
strate  a  graceful  degradation  in  the  ability  of  the  uncertainty  measure  to 
detect  the  correct  categories. 

Table  4.1  is  a  list  of  features  used  to  describe  the  leaves,  the  values  in 
the  range  of  each  feature,  and  a  brief  description  of  how  they  would  be 
computed  from  an  image.  One  of  the  features  normally  used  by  botanists  to 
describe  leaf  categories  is  “shape,”  where  several  distinct  shapes  types  are 
used  as  primitives.  Since  this  feature  bordered  on  not  being  a  well  defined 
computation,  it  was  replaced  with  the  three  features  of  width,  length,  and 
flare,  where  flare  is  the  direction  and  degree  of  tapering  of  the  leaf. 

Using  these  features,  we  can  descrioe  several  leaf- specifications.  A  speci¬ 
fication  is  a  set  of  values  for  each  feature  which  would  be  consistent  with  the 
description  of  a  leaf  species  found  in  Preston  [1976].  Table  4.2  provides  the 
set  of  specifications  used  for  the  examples  used  here.  Several  points  should 
be  made  about  the  features.  First,  there  are  features  which  are  not  highly 
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Feature  Values 


Length  {1,2,3,. .  .,8,9} 


Width  {1,2, 3, ...,6,7} 


Flare  {-2, -1,0, 1,2} 


Lobes _ {1,2,3,..  .,8,9} _ 

Margin  {Entire,  Crenate,  Serrate 
Doubly  Serrate} 


Apex  {Rounded,  Acute,  Accuminate} 


Base  {Rounded,  Cumeate,  Truncate} 


Color  {Light,  Dark,  Yellow} 


Method  of  Computation 


Measure  directly 


Measure  directly 


Fit  best  ovoid 


Filter  and  count 
Fractal  dimension  of  edge 


Curvature  of  tip 


Curvature  of  base 


Measure  green  component  of  color 


Table  4.1:  Leaf  features  and  values. 


constrained  by  the  specifications  and  which  have  high  inter-category  overlap, 
e.g.  length  and  width.  Second,  the  specifications  are  disjoint;  no  leaf  could  be 
constructed  which  satisfies  more  than  one  specification.  Finally,  there  is  no 
small  subset  of  features  (less  than  4)  which  would  be  sufficient  to  distinguish 
between  the  species. 

A  “leaf  generator”  has  been  constructed  which  takes  as  input  a  leaf  spec¬ 
ification  and  produces  a  property  vector  consistent  with  the  specification. 
These  property  vectors  are  the  “objects”  used  for  all  of  the  experiments. 

4.6.2  Evaluation  of  taxonomies 

To  begin  our  testing  of  the  behavior  of  the  uncertainty  measure  U,  let  us 
consider  the  task  of  evaluating  levels  of  a  taxonomy.  Although  this  presup¬ 
poses  being  provided  a  taxonomy  to  evaluate,  we  will  be  able  to  check  that 
the  measure  U  behaves  as  predicted.  We  will  be  able  to  compare  the  situa¬ 
tion  in  which  there  is  structure  in  the  taxonomy  to  that  when  a  taxonomy  is 
created  randomly.  Later,  we  will  address  the  problem  of  discovering  natural 
categories  in  an  object  population. 

Figure  4.11  is  an  example  taxonomy.  At  the  bottom  of  the  taxonomy 
are  the  leaves;  in  this  case  there  are  the  four  types  of  Oak,  Maple,  Poplar 
and  Birch.  These  single  object  categories  are  combined  to  form  the  next 
level  of  categories,  continuing  on  until  all  the  leaves  are  in  one  category. 
The  letters  written  in  each  node  indicate  the  types  of  leaves  contained  in 
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Figure  4.11:  A  jumbled  taxonomy  of  Oak,  Maple,  Poplar,  and  Birch  leaves. 
Leaves  are  randomly  combined  to  form  higher  category. 
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Figure  4.12:  An  ordered  taxonomy  of  Oak,  Maple,  Poplar,  and  Birch  leaves. 


Maple 

Poplar 


Length 

Width 

Flare 

13,4,5,6} 

{3,4,5} 

0 

{1,2,3} 

{1.2} 

{0,1} 

{5, 6, 7, 8, 9} 

{2, 3, 4, 5} 

0 

{2, 3, 4, 5} 

{1,2,3} 

0 

{3, 4, 5, 6} 

{2, 3, 4, 5} 

2 

Margin 

Apex 

Base 

Color 

Entire 

Acute 

Truncate 

Light 

Crenate, 

Serrate 

Acute 

Rounded 

Yellow 

Entire 

Rounded 

Cumeate 

Light 

Doubly-Serrate 

Acute 

Rounded 

Dark 

Crenate 

Acuminate 

Truncate 

{Light, Dark 
Yellow} 

Doubly 

Serrate 

Accuminate 

Rounded 

Dark 

{4,5,6} 


Table  4.2:  Leaf  specifications  for  several  species  of  leaves.  A  leaf  generator  was 
designed  which  created  property  vectors  consistent  with  the  different  specifications. 


that  node.  The  bottom  nodes  (the  leaves  of  the  tree,  if  you  will)  represent 
single  instances  of  leaves.  Figure  4.11  is  a  random  taxonomy,  where  the 
leaves  were  arbitrarily  combined  to  form  higher  categories.  There  is  a  total 
of  9  levels  (0-8)  indicating  256  leaves,  64  of  each  type.  For  comparison, 
Figure  4.12  is  an  ordered  taxonomy  of  the  same  leaves  in  which  the  nodes 
have  been  constructed  so  as  to  preserve  the  natural  classes  of  the  species. 
The  first  question  we  will  consider  is  how  the  two  components  Up  and  Uc  of 
the  total  uncertainty  measure  behave  as  we  evaluate  different  levels  of  these 
two  taxonomies. 

4.6.3  Components  of  uncertainty 

In  the  graph  of  Figure  4.13  the  quantities  of  Up  and  normalized  Uc  are  plot¬ 
ted  as  a  function  of  depth  in  the  taxonomy.  A  depth  of  zero  corresponds  to 
the  top  level  of  the  taxonomy  with  only  one  category,  a  depth  of  8  (because 
there  were  256  leaves  in  this  example)  is  the  finest  categorization.  Both 
curves  are  monotonic  in  depth  as  predicted  when  the  quantities  were  de¬ 
rived.  Notice  that  both  curves  vary  smoothly,  indicating  no  special  level  in 
the  taxonomy.  Because  the  taxonomy  was  created  by  randomly  combining 
leaves,  no  level  contains  any  more  structure  than  any  other  level. 

Now  let  us  consider  the  taxonomy  in  Figure  4.12.  In  this  case  the  taxon¬ 
omy  segregates  the  different  types  of  leaves  at  level  2,  with  the  finer  divisions 
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Figure  4.13:  The  evaluation  curves  for  the  jumbled  taxonomy.  Plotted  are  Up 
and  the  normalized  Uc  * s  a  function  depth.  The  normalization  factor  causes  the 
scales  of  the  two  graphs  to  be  the  same.  Both  curves  change  smoothly,  indicating 
no  special  level  within  the  taxonomy. 


below  that  level  being  made  randomly.  The  evaluation  curves  for  this  taxon¬ 
omy  are  plotted  in  Figure  4.14.  Now  the  curves  no  longer  vary  smoothly,  but 
have  a  distinct  break  at  the  second  level  where  the  different  types  of  leaves 
are  segregated  into  different  categories.  Let  us  trace  each  of  the  curves.  As 
predicted,  the  property  uncertainty  starts  at  a  maximum  at  level  0.  Split¬ 
ting  into  two  categories,  each  containing  two  types  of  leaves,  significantly 
reduces  the  property  uncertainty  since  knowing  which  of  the  two  categories 
a  leaf  comes  from  restricts  its  properties  to  being  of  one  of  two  types  of 
leaves  instead  of  four.  The  next  split  into  four  categories  (at  level  2)  causes 
a  similar  decrease  in  property  uncertainty.  However,  after  level  two,  there  is 
no  significant  decrease  in  property  uncertainty  because  a  category  which  has 
32  leaves  of  one  type  has  not  much  less  property  uncertainty  than  a  category 
which  has  64  leaves  of  that  type.  The  property  uncertainty  remains  almost 
constant  until  end  effects  occur  and  there  are  few  leaves  per  category. 

The  category  uncertainty  Uc  also  markedly  changes  its  behavior  at  the 
second  level.  As  expected,  at  level  0,  where  all  objects  are  in  one  category, 
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Figure  4.14;  The  evaluation  curves  for  the  ordered  taxonomy.  Plotted  are  Up 
and  the  normalized  t/c  m  a  function  depth.  For  the  structured  case  of  the  ordered 
taxonomy  the  level  at  which  the  species  of  leaves  are  separated  —  level  2  —  shows 
a  marked  break  in  both  Up  and  Uc- 


there  is  no  category  uncertainty.  Splitting  the  leaves  into  two  categories 
which  do  not  share  leaves  of  the  same  type  produces  only  a  margined  increase 
because  the  categories  are  quite  distinct  and  partially  described  leaves  are 
still  easily  categorized.  Splitting  into  the  four  leaf  types  similarly  adds  little 
category  uncertainty.  However  the  next  split  causes  an  abrupt  increase  in 
category  uncertainty.  This  is  caused  by  the  fact  that  now  there  are  two 
categories  containing  leaves  of  each  type.  Therefore  a  partially  described 
leaf  will  often  match  leaves  in  more  than  one  category,  yielding  a  high  value 
in  category  uncertainty.  As  the  categorization  gets  finer  Uc  continues  to 
increase. 

It  is  important  to  notice  that  the  graphs  of  Figure  4.14  are  similar  to 
those  of  Figure  4.8.  In  that  example  we  evaluated  a  taxonomy  of  purely 
modal  classes  and  features,  but  with  the  addition  of  severed  noise  features. 
This  similarity  indicates  that  features  which  are  not  purely  modal  —  they 
do  not  perfectly  discriminate  between  classes  —  but  which  do  have  some 
diagnostic  power  may  be  viewed  as  the  combination  of  modal  features  with 
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Figure  4.15:  The  evaluation  curves  for  the  ordered  taxonomy,  but  with  the  addition 
of  four  noise  features.  Both  Up  and  the  normalized  Uc  show  a  definite  break  at  level 
2  —  the  level  corresponding  to  the  separate  species  —  but  the  curves  are  becoming 
more  like  those  of  the  jumbled  evaluation. 


noise  features.  To  further  illustrate  this  point,  we  can  add  more  noise  to  the 
leaf  example  by  including  pure  noise  features.  Figure  4.15  displays  the  results 
of  using  the  same  ordered  taxonomy  of  Figure  4.12  but  with  the  addition  of  4 
noise  features;  for  each  leaf,  each  of  the  noise  features  was  randomly  assigned 
one  of  4  values.  There  is  still  a  definite  break  in  both  the  Up  and  the  Uc 
curves,  but  they  are  becoming  more  like  those  of  the  jumbled  evaluation  of 
Figure  4.13.  This  graceful  degradation  with  the  addition  of  noise  is  essential 
if  the  category  evaluation  function  is  to  be  included  in  a  robust  method  for 
recovering  natural  categories. 

To  summarize,  we  have  empirically  shown  that  the  evaluation  function 
is  indeed  sensitive  to  the  structure  of  natural  classes  —  in  this  case  differ¬ 
ent  leaf  species.  This  sensitivity  is  indicated  by  the  marked  change  in  the 
behavior  of  the  quantities  Up  and  Uc  at  the  depth  of  the  taxonomy  which 
corresponds  to  the  “correct”  categories.  Also,  the  components  of  the  evalu¬ 
ation  function  behave  predictably  in  the  absence  of  natural  categories;  this 
last  point  is  crucial  since  if  we  are  to  use  this  evaluation  function  to  recover 
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natural  categorizations  we  must  be  able  to  distinguish  between  a  minimum 
caused  by  structure  and  a  minimum  which  occurs  at  some  arbitrary  level  of 
categorization.  To  explore  this  question  further  let  us  investigate  how  the 
parameter  A  affects  the  evaluation  of  the  taxonomies. 

4.6.4  A-space  behavior 

Let  us  return  to  our  task  of  selecting  the  best  level  of  a  taxonomy  for  a  given 
A.  Figure  4.16  shows  the  graphs  of  the  total  uncertainty  U  =  (1  —  A)  Up  + 
XtjUc  for  the  jumbled  taxonomy  of  Figure  4.11  with  A  equal  to  .2,  .4,  .6, 
and  .8.  Selection  of  the  best  level  for  a  given  lambda  is  simply  finding  the 
depth  which  has  the  lowest  value  of  U .  In  the  case  of  the  jumbled  taxonomy, 
only  the  two  extremes  of  depth  are  ever  the  minimum,  with  the  trade-off 
occurring  at  about  .5.  From  the  graphs  for  Up  and  Uc  of  Figure  4.16  we  can 
construct  the  A-space  diagram  of  Figure  4.17.  The  complete  lack  of  structure 
in  the  taxonomy  is  reflected  in  this  degenerate  A-space  diagram;  we  have 
empirically  demonstrated  the  predicted  noise  behavior  of  section  4.5.2. 

Next  let  us  consider  how  the  total  uncertainty  U  varies  with  A  for  the 
ordered  taxonomy.  Graphs  of  U  as  a  function  of  taxonomy  depth  for  four 
different  values  of  A  are  shown  in  Figure  4.18.  Notice  that  for  all  four  values 
(.2,  .4,  .6,  .8)  the  second  level  has  the  lowest  total  uncertainty;  the  second 
level  corresponds  to  the  categorization  which  contains  four  categories,  each 
containing  all  the  leaves  of  one  species.  Although  we  know  that  by  design  a  A 
of  0  will  select  the  finest  depth  (8),  and  that  a  A  of  1.0  will  select  the  coarsest 
depth  (0),  for  this  data  a  A  in  the  interval  of  approximately  0.1-0. 9  will  select 
the  categorization  containing  four  categories.  In  Figure  4.19  we  construct  the 
A-space  diagram  for  these  data.  The  existence  of  the  large  stable  region  is  an 
indicator  that  the  categorization  selected  in  that  region  contains  categories 
that  are  highly  structured  in  terms  of  the  way  they  minimize  the  uncertainties 
of  the  inferences  about  an  object’s  properties  and  its  category,  it  should  be 
noted  that  the  categories  selected  are  those  that  correspond  to  the  classes  of 
leaves  as  defined  by  the  botanists. 
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Figure  4.17:  The  A-space  diagram  for  the  jumbled  taxonomy.  The  degenerate 
condition  of  only  having  only  the  extreme  categorizations  be  stable  reflects  the 
lack  of  structure  in  the  taxonomy. 
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Figure  4.10:  The  A-space  diagram  for  the  ordered  taxonomy.  The  categorization 
composed  of  four  categories,  each  representing  one  type  of  leaf,  is  stable  preferred 
over  a  wide  range  of  A,  approximately  0.1  to  0.9.  This  stable  region  indicates  that 
the  four  categories  are  structured  in  a  manner  consistent  with  natural  classes. 


Chapter  5 

Recovering  Natural  Categories 


We  have  defined  the  task  of  the  observer  to  be  that  of  discovering  categories 
of  objects  in  the  world  that  correspond  to  natural  modes;  these  modes  are  the 
natural  clusters  of  objects  formed  by  the  interaction  between  the  processes 
that  produce  objects  and  the  environmental  constraints  that  act  upon  them. 
Because  objects  of  the  same  natural  class  behave  similarly,  establishing  a 
natural  categorization  —  a  natural  set  of  categories  —  permits  the  observer 
to  make  inferences  about  the  properties  of  an  object  once  the  category  of 
that  object  is  determined.  The  question  we  address  in  this  chapter  is  what 
are  the  necessary  capabilities  that  must  be  provided  to  an  observer  if  he  is 
to  accomplish  this  task? 

We  can  divide  the  object  categorization  task  into  two  components.  First, 
if  the  observer  is  to  ever  succeed  in  generating  a  natural  categorization,  then 
he  must  be  able  to  determine  when  a  categorization  reflects  the  structure 
of  natural  modes.  Given  a  set  of  alternative  categorizations,  the  observer 
must  be  able  to  select  the  most  likely.  Thus,  he  must  be  provided  with  a 
categorization  evaluation  function.  Second,  the  observer  needs  a  method  of 
producing  categorization  proposals.  As  objects  are  viewed,  the  observer  must 
be  continually  refining  his  current  categorization,  attempting  to  recover  the 
natural  categories  present.  The  categorization  generation  method  must  be 
constructed  such  that  the  observer  will  eventually  propose  a  categorization 
corresponding  to  the  natural  modes. 

We  begin  this  chapter  by  developing  a  categorization  paradigm  that  makes 
these  two  components  explicit  and  that  agrees  with  ones  intuition  about  the 
categorization  process;  our  development  of  the  paradigm  is  inspired  by  work 
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in  formal  learning  theory  [Osherson,  Stob,  and  Weinstein,  1986].  Then,  we 
will  present  a  categorization  algorithm  based  upon  this  paradigm,  that  has 
been  implemented  and  tested;  the  operation  and  performance  of  the  algo¬ 
rithm  is  demonstrated  by  examples  drawn  from  three  domains.  Analysis  of 
the  competence  of  the  algorithm  provides  insight  into  the  effectiveness  of  the 
categorization  procedure  as  well  as  the  types  of  errors  that  may  be  expected. 
In  particular,  for  certain  ideal  cases,  the  algorithm  is  shown  to  be  guaranteed 
to  converge  to  the  correct  categories.  Finally,  possible  modifications  of  the 
algorithm  to  improve  its  behavior  are  discussed. 

5.1  A  Categorization  Paradigm 

Consider  the  leaves  pictured  in  Figure  5.1.  To  most  observers  there  are  three 
groups  of  leaves  present:  ACH,  BFG,  DEJ.  In  fact,  botanists  would  state  that 
there  really  are  three  classes  of  objects  present,  and  that  an  observer  who 
identifies  those  three  classes  has  categorized  the  leaves  “correctly.”  Using 
these  leaves  as  an  example  let  us  develop  a  paradigm  for  categorizing  objects 
that  is  not  only  consistent  with  our  intuitions  about  categorization  but  also 
permits  us  to  precisely  define  the  object  categorization  problem.1  We  view 
the  categorization  task  as  a  learning  problem:  the  observer  attempts  to 
learn  natural  object  categories  as  he  inspects  the  world  of  objects.  Thus,  the 
categorization  paradigm  we  present  closely  resembles  the  generalized  learning 
paradigm  developed  by  Osherson,  Stob,  and  Weinstein  [1986],  based  upon 
the  language  acquisition  paradigm  originated  by  Gold  [1967].  Our  paradigm 
consists  of  four  components;  each  is  necessary  to  define  the  categorization 
task  precisely. 

The  first  requirement  is  that  the  goal  of  categorization  be  stated  clearly. 
We  define  a  categorization  to  be  a  partition  of  the  objects  in  a  population; 
the  equivalence  classes  of  the  partition  form  the  categories.  Thus,  for  Fig¬ 
ure  5.1,  any  possible  grouping  of  the  leaves  constitutes  a  categorization, 
and  the  groups  are  the  categories.  However,  if  the  observer  is  attempting 

'In  Bobick  and  Richards  [1986],  a  formal  description  of  the  categorization  paradigm  is 
provided.  The  terminology  developed  there  permits  a  formal  statement  of  the  categoriza¬ 
tion  problem.  However,  most  of  the  important  issues  developed  there  can  be  discussed 
informally,  by  considering  an  example  problem.  The  reader  is  referred  to  Bobick  and 
Richards  [1986]  if  further  detail  is  required. 
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to  discover  the  “correct”  categories  than  we  need  a  basis  for  determining 
a  natural  categorization.  To  provide  such  a  basis  we  require  the  Principle 
of  Natural  Modes:  environmental  pressures  interacting  with  natural  object 
processes  cause  the  world  to  be  clustered  in  the  space  of  properties  impor¬ 
tant  to  the  interaction  between  objects  and  the  environment.  We  refer  refer 
to  these  clusters  as  natural  classes.  In  Figure  5.1  the  natural  classes  cor¬ 
respond  to  the  different  species  of  leaves:  White  Oak  ( BFG ),  Sugar  Maple 
(DEG),  Poplar  (ACH).  The  goal  of  the  observer  is  to  recover  these  natural 
classes  since  they  represent  instances  of  objects  which  share  common  proper¬ 
ties;  they  were  generated  by  the  same  natural  object  process.  Thus,  natural 
classes  serve  to  define  a  testable  goal  of  the  categorization  procedure:  pro¬ 
duce  categories  of  objects  corresponding  to  natural  classes.2  These  natural 
categories  are  the  first  component  of  the  categorization  paradigm. 

Three  difficulties  arise  if  we  consider  the  recovery  of  “natural  categories” 
as  an  objective  goal  for  the  categorization  task.  Two  are  philosophical.  First, 
how  can  we  independently  judge  if  the  observer  is  successful?  To  do  so,  we 
require  independent  identification  of  the  natural  classes,  an  omniscient  ob¬ 
server  or  an  oracle.  Second,  as  demonstrated  by  Goodman  [1951],  Quine 
[19G9],  and  Watanabe[l985],  natural  categories  may  only  be  said  to  exist  if 
we  restrict  the  properties  of  objects  that  are  considered  important.  Oth¬ 
erwise,  all  objects  are  equally  similar.  (See  section  2.3.1  for  a  review  and 
proof  of  Watanabe’s  Ugly  Duckling  Theorem,  a  theorem  that  explains  this 
counter-intuitive  claim.)  How  then  can  we  say  that  one  set  of  categories 
is  more  natural  than  another?  To  resolve  these  problems,  we  rely  on  the 
sciences  that  study  the  domains  in  question  to  provide  an  independent  as¬ 
sessment  of  the  natural  classes.  Because  botanists  have  categorized  the  leaves 
in  Figure  5.1  into  three  species,  and  because  botanists  study  the  processes 
that  create  leaves  and  environments  that  constrains  them,  we  will  assume 
that  the  categories  constructed  by  botanists  represent  “true  natural  classes.” 

The  third  difficulty  in  using  natural  classes  as  a  baseline  against  which  to 
judge  the  competence  of  the  observer  is  computational  in  nature.  We  have 

2  We  have  purposely  avoided  using  the  phrase  the  natural  classes  because  we  do  not  wish  to 
claim  that  there  is  a  unique  clustering  of  objects  corresponding  to  a  natural  partitioning 
As  discussed  in  chapter  2,  both  the  division  between  mammals  and  birds  and  that  between 
cows  and  rabbits  represent  natural  clusterings.  Thus,  two  observers  could  both  “correctly” 
learn  the  natural  categories  of  the  world  and  yet  have  different  categorizations.  We  will 
return  to  this  issue  in  the  next  chapter. 
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defined  a  categorization  to  be  a  partition  of  the  objects  in  the  world.  But, 
there  is  an  uncountable  infinity  of  possible  objects.3  Thus,  the  number  of 
partitions  is  also  uncountably  infinite.  Even  if  there  are  only  denumerably 
many  objects  (an  assumption  that  would  be  valid  if  objects  are  produced  by  a 
countable  number  of  “computational”  construction  procedures)  there  would 
still  be  an  uncountable  set  of  partitions.  How  can  the  observer  ever  hope  to 
discover  the  correct  set  of  categories  if  the  space  of  potential  categorizations 
is  unsearchable? 

We  can  remedy  this  situation  by  placing  a  constraint  on  the  categorization 
environment  — -  the  third  component  of  the  categorization  paradigm  —  and 
by  modifying  the  goal  of  observer.  When  an  observer  views  an  object  (such 
as  one  of  the  leaves  in  Figure  5.1),  he  cannot  make  use  of  the  object  itself 
as  input  to  a  categorization  procedure.  Rather,  he  must  operate  on  some 
sensory  description  of  the  objects.  Thus,  let  us  construct  a  categorization 
environment  that  consists  of  objects  as  described  in  some  representation. 
As  in  chapter  4,  we  define  a  representation  to  be  a  mapping  from  the  set 
of  all  possible  objects  onto  to  some  finite  set  0*.4  Each  element  of  0* 
is  referred  to  as  an  object  description.  Because  the  observer  is  no  longer 
operating  on  the  objects  themselves,  but  on  their  description  as  expressed 
in  some  representation,  we  alter  the  definition  of  an  object  categorization: 
a  categorization  Z  is  a  partition  of  the  set  of  representational  descriptions 
corresponding  to  the  objects  in  the  world.  Now,  because  there  are  only  a 
finite  number  of  object  descriptions  in  the  representation,  the  set  of  possible 
categorizations  is  not  only  countable,  but  also  finite.  Thus,  one  can  construct 
computational  procedures  capable  of  searching  the  space  of  solutions  to  the 
categorization  problem. 

One  may  view  a  representation  as  a  generalization  or  abstraction  mech¬ 
anism:  an  infinite  number  of  objects  are  mapped  onto  a  single  point  in  the 
representation.  Thus  an  important  question  arises  as  to  whether  a  given  rep¬ 
resentation  is  sufficient  to  permit  correct  categorization.  Let  us  (informally) 
define  a  class  preserving  representation  to  be  one  in  which  disjoint  natural 


classes  of  objects  in  the  world  map  into  disjoint  sets  in  0*.5  If  two  objects 
of  different  natural  classes  map  into  the  same  point  in  the  representation, 
then  the  representation  is  not  class  preserving  and  the  observer  will  not  be 
able  to  correctly  categorize  the  objects.  Therefore,  for  the  observer  to  be 
successful  in  his  task,  the  representation  must  be  constrained  to  match  the 
structure  of  classes.  Once  again  we  encounter  the  Ugly  Duckling  Theorem 
of  Watanabe  [1986]  and  require  that  the  representation  be  chosen  so  that 
important  properties,  in  this  case  those  constrained  by  the  natural  classes, 
are  made  explicit.  We  refer  to  the  description  of  classes  of  objects  in  terms 
of  a  representation  as  the  projection  of  the  classes  onto  that  representation. 

Having  defined  a  class  preserving  representation  we  can  now  modify  the 
component  of  the  categorization  paradigm  corresponding  to  the  natural  cat¬ 
egories.  Instead  of  recovering  the  natural  classes  directly,  the  task  of  the 
observer  becomes  the  recovery  of  the  projection  of  the  natural  classes  in 
some  class  preserving  representation.  The  categorization  proposed  by  the 
observer  —  the  hypothesized  partition  of  representation  space  —  must  be 
constructed  such  that  if  two  objects  in  the  world  are  mapped  by  the  rep¬ 
resentation  into  the  same  category  (equivalence  class)  of  the  partition  then 
those  two  objects  belong  to  the  same  natural  class. 

To  complete  the  definition  of  the  categorization  environment,  we  must 
specify  how  the  observer  comes  to  experience  the  objects.  In  the  example 
of  the  leaves  in  Figure  5.1,  the  observer  may  simply  view  all  of  the  objects 
“simultaneously.”  For  a  large  or  infinite  world,  a  parallel  observation  of  all 
objects  is  not  possible.  Thus  we  define  an  observation  sequence  to  be  an  in¬ 
finite  sequence  of  objects,  each  described  according  to  some  representation; 
this  sequence  is  viewed  serially  by  the  observer.  We  require  that  the  sequence 
be  infinite  so  that  the  observer  always  has  data  available  as  input  to  a  cate¬ 
gory  recovery  procedure.  However,  there  are  only  a  finite  number  of  distinct 
object  descriptions.  Therefore  we  will  require  that  any  object  description 
that  represents  some  object  in  the  world  must  appear  in  the  observation 
sequence  an  infinite  number  of  times.  This  property  of  the  observation  se¬ 
quence  will  be  important  when  we  discuss  the  error  correcting  capability  of 
a  categorization  procedure.  Note  that  our  definition  of  observation  sequence 
guarantees  that  there  exists  a  point  in  the  observation  sequence  at  which  the 

5Bobick  and  Richards  [1986]  provides  a  formal  definition  of  a  class  preserving 
representation. 


observer  will  have  viewed  all  the  object  descriptions  representing  objects  in 
the  world. 

After  encoding  some  information  about  the  objects  in  the  world,  the 
observer  must  propose  some  candidate  categorizations.  In  our  categorization 
paradigm,  we  require  that  the  observer  announce  a  categorization  hypothesis 
after  each  presentation  of  an  object  from  the  observation  sequence.  We 
can  decompose  the  task  of  announcing  hypotheses  into  two  components: 
hypothesis  generation  and  hypothesis  evaluation.  These  tasks  form  the  last 
two  components  of  the  categorization  paradigm. 

Hypothesis  generation  refers  to  the  method  used  by  the  observer  to  pro¬ 
pose  candidate  categorizations.  One  simple  approach  would  be  to  simply 
check  all  possible  partitions  of  the  objects  viewed  so  far:  there  axe  only  a 
finite  number  of  object  descriptions  and  thus  only  a  finite  number  of  possible 
partitions.  By  our  definition  of  of  the  observation  sequence,  we  know  that 
the  observer  only  need  to  wait  some  finite  amount  of  time  before  he  will 
have  viewed  all  the  object  descriptions  present  in  the  world.  Assuming  that 
the  observer’s  decision  criteria  —  the  evaluation  procedure  to  be  discussed 
presently  —  are  capable  of  selecting  the  natural  class  categorization,  then 
an  exhaustive  enumeration  is  guaranteed  to  find  the  correct  categories. 

Unfortunately,  the  combinatorics  of  an  exhaustive  search  make  such  a 
procedure  impractical.  For  the  9  objects  of  figure  Figure  5.1,  there  are  over 
20,000  possible  partitions.  If  there  are  15  objects,  the  number  of  partitions 
(categorizations)  grows  to  1.4  billion!  Also,  an  exhaustive  partitioning  strat¬ 
egy  is  not  well  suited  to  a  sequential  presentation  of  objects  provided  by  the 
categorization  environment.  When  a  new  object  is  viewed,  the  previous  hy¬ 
pothesis  is  irrelevant  because  an  exhaustive  search  would  again  be  executed 
and  the  best  partition  selected.  In  a  world  with  thousands  of  objects,  the 
discarding  of  previous  hypotheses,  and  the  work  associated  with  producing 
them,  is  unacceptable. 

An  alternative  to  the  exhaustive  search  is  a  dynamic,  data-driven  method 
of  hypothesis  generation.  This  is  the  approach  used  in  dynamic  classification 
(see,  for  example,  Duda  and  Hart  [1973]).  At  each  presentation  of  an  object, 
the  observer  considers  some  (usually  small)  set  of  candidate  hypotheses  based 
upon  the  previous  hypothesis  and  the  new  object.  Because  one  can  limit 
the  degree  to  which  any  new  object  may  alter  the  current  hypothesis,  the 
incremental  strategy  has  the  advantage  that  the  computational  complexity 
of  computing  the  new  hypotheses  can  be  restricted. 


The  use  of  an  incremental  approach  raises  some  issues  that  axe  not  rel¬ 
evant  when  employing  an  exhaustive  strategy.  In  particular,  one  must  con¬ 
sider  whether  the  observer  will  ever  converge  to  some  particular  hypothesis. 
Even  though  we  know  that  there  are  only  a  finite  number  of  partitions,  it 
may  be  the  case  that  the  observer  never  converges  to  some  particular  hy¬ 
pothesis;  for  example,  the  observer  may  continually  cycle  through  all  the 
possible  partitions.  Also,  because  the  observer  is  not  considering  all  possible 
hypotheses,  we  must  consider  whether  he  will  ever  propose  the  “correct” 
one.  In  the  next  section  we  will  describe  an  incremental  hypothesis  gener¬ 
ation  method  that  has  been  successfully  demonstrated  in  several  domains. 

It  will  be  shown  that  in  certain  ideal  cases,  the  method  can  be  constructed 
such  that  it  will  converge  with  unit  probability  to  the  “correct”  hypothe¬ 
sis;  experimental  results  will  demonstrate  the  method’s  effectiveness  on  real 
data. 

Finally,  given  a  set  of  candidate  categorizations,  the  observer  needs  to 
be  able  to  select  the  one  most  likely  correct:  the  one  which  is  the  most 
“natural."  To  accomplish  this  task,  the  observer  requires  a  hypothesis  eval¬ 
uation  function.  This  function  must  be  constructed  such  that  categories 
corresponding  to  the  natural  classes  are  preferred  over  categories  that  ignore 
class  structure.  Like  the  representation,  which  is  required  to  make  explicit 
the  properties  of  objects  constrained  by  natural  processes,  the  hypothesis 
evaluation  function  must  be  matched  to  the  structure  of  the  natural  world. 

Having  defined  the  four  components  of  categorization  we  may  now  state 
the  categorization  problem  more  precisely.  We  assume  the  following  are 
given:  a  set  of  natural  object  classes ,  a  class  preserving  representation  in 
which  objects  are  described,  an  observation  sequence  in  which  all  the  object 
descriptions  are  presented,  a  hypothesis  generation  method  to  produce  can¬ 
didate  categorizations,  and  a  hypothesis  evaluation  function  which  provides 
criteria  as  to  which  categorization  should  be  chosen.  We  say  that  the  ob¬ 
server  has  successfully  categorized  the  world  of  objects  on  some  observation 
sequence  if  and  only  if  1)  he  announces  some  categorization  hypothesis  after 
every  presentation  of  an  object  description,  and  2)  the  observer  eventually 
converges  to  a  hypothesis  which  is  the  projection  the  natural  classes  in  the 
class  preserving  representation.  By  “converge”  we  mean  that  the  observer 
eventually  announces  the  correct  hypothesis  and  that  he  never  deviates  from 
that  hypothesis  as  he  continues  to  view  the  observation  sequence. 

Notice  that  any  particular  categorization  of  objects  is  learnable.  A  strongly 


nativist  theory  of  object  categorization  would  claim  that  the  observer  always 
announces  a  hypothesis  corresponding  to  one  particular  categorization  Zq, 
where  Z o  has  been  selected  by  evolution  to  appropriately  categorize  the 
world.  That  is,  the  observer  would  completely  ignore  the  data  of  the  obser¬ 
vation  sequence.  However,  it  is  unreasonable  to  expect  evolution  to  provide 
for  an  object  category  such  as  “refrigerator.”  A  more  plausible  theory  of  cat¬ 
egorization,  and  that  which  has  been  proposed  here,  is  that  evolution  equips 
the  observer  with  the  necessary  tools  —  representation,  hypothesis  genera¬ 
tion  method,  hypothesis  evaluation  function  —  for  the  recovery  of  natural 
object  categories. 


5.2  Categorization  Algorithm 

In  this  section,  we  will  present  a  categorization  system  that  reflects  the 
paradigm  developed  above;  this  system  has  been  implemented  and  tested. 
Because  the  representation  (the  most  important  aspect  of  the  categorization 
environment)  and  the  hypothesis  evaluation  function  are  described  in  detail 
in  chapter  4,  we  will  only  provide  a  brief  description  of  these  components  of 
the  categorization  system.  The  hypothesis  generation  method,  however,  will 
be  presented  in  detail.  We  will  evaluate  the  performance  of  the  algorithm  by 
examining  the  results  of  tests  conducted  in  three  domains.  In  the  following 
sections,  we  will  consider  the  competence  of  the  categorization  algorithm, 
the  types  of  errors  likely  to  arise,  and  possible  remedies. 
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5.2.1  Categorization  environment 

The  representation  —  the  first  component  of  the  categorization  environ¬ 
ment  —  used  by  the  categorization  system  consists  of  property  vectors.  Our 
terminology  is  defined  as  follows:  feature  refers  to  a  function  or  predicate 
computed  about  an  object;  value ,  to  the  value  taken  by  a  feature;  property , 
to  the  valued  feature.  For  example,  “length”  is  a  feature,  “6  ft.”  is  a  value, 
and  “having  length  6  ft.”  is  a  property.  Each  feature,  /*,  1  <  i  <  m,  has 
an  associated  set  of  values  {f,i,  v#, . . . ,  referred  to  as  the  range  of  the 
feature.  We  require  that  the  range  be  a  finite  set  but  the  cardinality  of  the 
range  can  vary  from  one  feature  to  the  next.  F  denotes  the  set  of  features 
{fii  h,  ■  ■  • ,  fm}  Using  these  features,  each  object  is  represented  by  an  m- 
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dimensional  property  vector  P  =  (ula,  v2p, . . . ,  vmy)  where  is  the  jth  value 
of  the  range  of  the  ith  feature. 

To  complete  our  specification  of  the  categorization  environment  requires 
the  generation  of  an  observation  sequence.  Normally,  the  world  itself  pro¬ 
vides  a  set  of  objects  that  can  be  sampled  to  form  the  sequence.  However, 
to  test  the  categorization  algorithm  we  need  to  generate  property  vectors 
corresponding  to  objects.  Furthermore,  to  evaluate  the  performance  of  the 
categorization  system  these  property  vectors  must  be  constructed  such  that  a 
natural  categorization  exists.  To  satisfy  these  criteria,  property  specifications 
—  a  listing  of  acceptable  property  values  —  are  provided  for  several  classes 
of  objects.  An  object  generator  then  creates  property  vectors  consistent 
with  these  specifications.  We  will  describe  the  specific  features  and  values 
of  the  property  specifications  when  we  present  the  examples  illustrating  the 
performance  of  the  categorization  procedure  in  different  domains. 

5.2.2  Categorization  uncertainty  as  an  evaluation 
function 

The  hypothesis  evaluation  function  provides  the  criteria  by  which  proposed 
categorizations  are  selected.  Because  the  categorization  task  requires  recov¬ 
ering  the  natural  categories,  the  evaluation  function  must  reflect  the  natural 
structure  found  in  the  world. 

The  evaluation  function  is  based  upon  the  categorization  uncertainty  mea¬ 
sure  U .  It  is  defined  by: 

U(Z)  =  (l-\)UP(Z)  +  \ri(Z)Uc(Z)  (5.1) 

where  Up  is  the  uncertainty  about  the  properties  of  an  object  once  its  cat¬ 
egory  is  known,  Uc  is  the  average  uncertainty  of  the  category  to  which  an 
object  belongs  given  a  subset  of  the  properties  describing  the  object,  77  is  a 
normalization  coefficient  between  Up  and  Uc,  and  A  is  a  free  parameter  rep¬ 
resenting  the  desired  trade-off  between  the  two  uncertainties.  (See  chapter  4 
for  complete  definitions  and  derivations  of  these  terms.)  For  the  remainder 
of  this  chapter  we  will  assume  that  A  is  set  to  some  particular  value  which 
satisfies  the  goals  of  the  observer.  In  chapter  6  we  will  consider  the  effect  of 
A  on  the  categorization  procedure  and  its  interaction  with  the  natural  object 
categories. 
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We  use  the  total  uncertainty  of  a  categorization  U  as  an  evaluation  func¬ 
tion  because  this  measure  reflects  the  degree  to  which  a  categorization  per¬ 
mits  the  observer  to  accomplish  the  recognition  goal  of  making  reliable  infer¬ 
ences  about  the  properties  of  objects.  Thus  a  categorization  which  minimizes 
U  is  guaranteed  to  be  useful  to  the  observer:  the  evaluation  function  directly 
measures  the  utility  of  a  categorization.  This  desirable  property  of  the  evalu¬ 
ation  function  is  absent  in  the  standard  distance  metrics  employed  by  cluster 
analysis  techniques.  Furthermore,  the  Principle  of  Natural  Modes  supports 
the  claim  that  if  a  categorization  supports  the  goals  of  the  observer  then 
that  categorization  reflects  a  structuring  of  the  objects  consistent  with  the 
natural  modes.  As  such,  this  function  is  well  suited  for  the  evaluation  of 
proposed  categorizations. 

5.2.3  Hypothesis  generation 

The  hypothesis  generation  method  we  present  has  been  designed  to  be  con¬ 
sistent  with  the  categorization  paradigm.  First,  the  algorithm  is  guaranteed 
to  produce  a  hypothesis  at  each  point  along  the  observation  sequence.  Thus, 
the  observer  will  never  halt  and  refuse  to  announce  a  categorization.  Sec¬ 
ond,  categories  are  permitted  to  continually  split  and  merge  making  every 
possible  categorization  fall  within  the  scope  of  the  algorithm.  Finally,  the 
algorithm  takes  advantage  of  the  infinite  observation  sequence  by  correcting 
“mistakes”  only  when  viewing  an  object  previously  placed  in  an  incorrect 
category.  Because  the  observer  is  guaranteed  to  view  each  object  repeatedly, 
this  form  of  data  driven  error  correction  is  appropriate. 

The  method  may  be  described  as  a  hybrid  of  divisive  and  agglomerative 
clustering  techniques  [Duda  and  Hart,  1973;  Hand,  1981].  (See  chapter  3 
for  a  discussion  of  these  methods.)  The  basic  steps  of  the  algorithm  are  as 
follows: 

1.  Construct  an  initial  categorization  consisting  of  a  single  category  by 
randomly  selecting  a  small  number  of  objects  from  the  population. 

2.  View  a  new  object6  from  the  observation  sequence. 

3.  Given  a  current  categorization  hypothesis,  select  the  category  to  which 
adding  the  new  object  results  in  the  best  new  categorization  (“best” 

6The  term  “object”  refers  to  an  object  description  in  the  property  space  representation. 


in  terms  of  lowest  total  uncertainty  U.).  Add  the  new  item  to  the 
“selected”  category. 

4.  Test  if  merging  the  selected  category  with  any  other  category  yields  a 
better  categorization.  If  so,  merge  the  selected  category  with  the  best 
of  those,  and  make  the  resulting  category  the  new  selected  category. 

5.  Delete  any  objects  identical  to  the  new  object  which  were  previously 
categorized  into  a  category  different  than  that  which  was  selected  dur¬ 
ing  this  iteration. 

6.  If  there  are  “enough”  objects  in  the  selected  category,  attempt  to  split 
the  category  into  two  new  categories  such  that  a  better  categorization 
is  achieved. 

7.  Go  to  Step  2. 

We  postpone  examining  the  competence  of  the  algorithm  until  we  present 
examples  of  its  operation. 

5.2.4  Example  1:  Leaves 

The  first  domain  in  which  we  illustrate  the  performance  of  the  categorization 
algorithm  is  that  of  leaves,  like  those  in  Figure  5.1.  For  several  species  of 
leaves,  property  specifications  were  generated  according  to  descriptions  pro¬ 
vided  by  Preston  [197G].  (Table  5.1)  The  properties  chosen  are  known  to  be 
diagnostic  of  leaf  species  and  thus  are  sufficient  for  the  categorization  task. 
Note  that  for  these  classes  of  leaves  the  representation  is  class  preserving: 
no  property  vector  can  be  constructed  that  satisfies  more  than  species  spec¬ 
ification.  For  this  example,  the  free  parameter  A  of  the  evaluation  function 
U  has  been  set  to  a  value  of  0.6. 

Let  us  trace  the  categorization  process  by  examining  the  dynamic  output 
of  the  program  shown  in  figures  5.2  and  5.3.  As  each  new  object  is  viewed, 
a  new  row  is  added,  showing  the  categorization  proposed  by  the  system  in 
response  to  that  new  object;  the  new  object  is  shown  on  the  left.  In  these 
examples,  the  object’s  names  are  used  to  indicate  (to  the  programmer)  the 
true  classes  to  which  the  leaves  belong,  e.g.  COTTON-145  is  a  cottonwood 
leaf.  The  program,  of  course,  uses  only  the  property  vectors  of  the  objects  as 
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Length 

Width 

Flare 

Lobes 

Margin 

Apex 

Base 

Color 

Maple 

13,4,5,6} 

{3,4,51 

0 

5 

Entire 

Acute 

Truncate 

Light 

Poplar 

{1,2,3} 

11,2} 

{0,1} 

1 

Crenate, 

Serrate 

Acute 

Rounded 

Yellow 

Oak 

15,6,7,8,9} 

{2, 3, 4, 5} 

0 

7,9 

Entire 

Rounded 

Cumeate 

Light 

birch 

12,3,4,5} 

{1,2,3} 

0 

19 

Doubly-Serrate 

Acute 

Rounded 

Dark 

Cottonwood 

2 

1 

Crenate 

Acuminate 

Truncate 

{ Light.Dark. 
Yellow} 

Elm 

{4,5,6} 

{2,3} 

{0,-1} 

1 

Doubly 

Serrate 

Accumulate 

Rounded 

Dark 

Table  5.1:  Leaf  specifications  for  several  species  of  leaves.  A  leaf  generator  was 
designed  which  created  property  vectors  consistent  with  the  different  specifications. 


input.  The  circled  numbers  to  left  indicate  the  significant  events  that  we  will 
discuss.  In  this  example,  the  population  consists  of  150  leaves,  25  examples 
of  each  of  6  species. 

Event  1  is  the  start  of  the  categorization  algorithm.  Because  step  3 
of  the  algorithm  requires  a  current  categorization,  we  begin  with  an  initial 
categorization  consisting  of  a  small  random  collection  of  objects  forming 
one  category.  In  the  next  section,  when  we  analyze  the  performance  and 
competence  of  the  hypothesis  generation  method,  we  will  place  bounds  on 
how  large  this  initial  category  may  be. 

Event  2  represents  viewing  a  new  object,  in  this  case  the  leaf  COTTON- 
145.  Step  3  of  the  algorithm  selects  the  category  to  which  adding  the  new 
leaf  produces  the  best  categorization.  As  there  is  only  one  category  in  the 
current  categorization,  COTTON- 145  is  added  to  that  category.  Because 
there  are  as  yet  no  other  categories,  the  merging  step  (4)  and  the  deletion 
step  (5)  are  skipped.  Next,  the  splitting  step  (6)  is  executed.  It  is  important 
to  understand  the  details  of  this  step  because  the  spiffing  procedure  is  the 
only  means  by  which  a  new  category  can  be  created  and  thus  is  most  critical 
in  determining  the  competence  of  the  system. 

Because  the  evaluation  function  V  is  statistical  in  nature,  based  upon 
probabilities  and  information  theory,  it  does  not  yield  reliable  results  when 
applied  to  categories  that  are  too  small.  We  restrict  its  application  by  requir¬ 
ing  that  any  category  formed  by  splitting  contain  some  minimum  number  of 
objects;  for  this  particular  example,  a  category  was  required  to  contain  at 
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Figure  5.2:  The  output  of  the  categorization  program.  Each  row  is  the  pro¬ 
posed  categorization  after  having  viewed  the  object  listed  at  the  left.  The  output 
continues  to  the  next  page. 
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Figure  5.3:  Continuation  of  the  output  of  the  categorization  program. 
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least  4  objects.  Therefore  the  algorithm  does  not  attempt  to  split  a  category 
unless  it  contains  at  least  twice  the  minimum  number  of  objects  (8  in  this 
example). 

Assuming  a  category  is  large  enough  to  be  split,  as  is  the  case  at  event 
2,  candidate  partitions  must  be  created.  Because  the  number  of  partitions 
of  a  category  is  huge  (even  when  partitioning  a  set  into  only  two  subsets), 
not  ail  partitions  of  a  category  into  two  new  categories  may  be  attempted. 
Therefore,  only  some  (randomly  chosen)  divisions  are  tried.  Thus,  if  there 
exists  a  split  of  a  category  that  yields  a  better  categorization  than  the  cur¬ 
rent  hypothesis,  the  probability  of  discovering  that  partition  is  proportional 
to  the  number  of  partitions  attempted.  In  the  current  implementation  the 
number  of  partitions  considered  is  proportional  to  the  size  of  the  category; 
when  we  discuss  the  convergence  and  correctness  properties  of  this  algo¬ 
rithm,  this  sampling  rate  will  become  important.  At  event  2,  none  of  the 
partitions  considered  yielded  a  categorization  with  a  lower  uncertainty  than 
the  categorization  consisting  of  only  one  category. 

At  event  3,  however,  a  partition  is  accepted.  As  before,  the  new  leaf 
(COTTON-148)  is  added  to  the  only  category  in  the  current  categorization. 
However,  in  this  case  a  split  of  that  category  was  discovered  which  yielded 
a  better  categorization  than  the  single  category.  Event  4  is  another  instance 
of  successful  splitting. 

One  of  the  dangers  of  an  algorithm  such  as  this  is  that  it  is  possible  to 
cause  two  categories  to  be  created  which  should  be  one.  Event  5  is  an  exam¬ 
ple  of  such  an  occurrence.  In  this  case,  the  addition  of  the  leaf  MAPLE-4, 
caused  a  category  to  split,  separating  maples  from  oaks.  But  a  previous  split 
had  already  created  a  category  containing  oaks.  If  the  algorithm  is  to  suc¬ 
cessfully  categorize  these  objects,  then  these  two  categories  must  eventually 
be  merged.  That  is  the  purpose  of  step  4  in  the  algorithm.  At  event  6, 
the  leaf  OAK-40  was  initially  added  to  the  category  with  the  6  oak  leaves. 
This  category  was  then  merged  with  the  category  containing  the  two  other 
oak  leaves.  Even  though  this  second  category  contained  two  leaves  that  are 
not  oak,  the  merging  of  the  two  categories  yielded  a  better  categorization. 
Merging  assures  that  splinter  categories  that  are  created  because  of  the  order 
of  presentation  of  the  objects  may  later  be  reclaimed. 

One  more  form  of  error  correction  is  necessary.  Although  merging  can 
combine  categories  mistakenly  separated,  it  cannot  remove  isolated  errors 
that  result  from  previous  mistakes.  To  correct  this  type  of  error,  we  add 


\ 
b 


y.v. 


■v;rn 

'.'.A 


*  ,r. 

vV 

vv 

v.V\ 


«  *>«  *■<  M  ■ 1 1*. *« W. ' .'l < . V A>.  *•- 


,  ii,  t>,  «t.  kl  &». 


Lf'U'Ul^'U  I 


1 

I 


3 

$ 


•Si 

4 


?i 


the  deletion  step  of  the  algorithm  (5).  Examples  of  this  step  axe  shown  at 
events  7  and  8.  At  event  7,  the  leaf  COTTON- 138  was  viewed  for  the  second 
time,  and  placed  in  the  category  containing  only  other  cottonwood  leaves. 
Notice  that  there  were  two  cottonwood  leaves  present  in  the  oak  category, 
one  of  which  was  previous  instance  of  COTTON-138.  Because  the  current 
instance  was  placed  in  a  different  category,  the  program  cam  correct  an  earlier 
“mistake”’  by  deleting  of  the  previous  instance.  Event  8  is  a  similar  event 
where  MAPLE-17  was  corrected.7 

The  last  categorization  shown  in  Figure  5.3  represents  the  steady  state 
categorization  produced  by  the  algorithm;  at  this  point  the  program  was 
interrupted.  Notice  that  the  categorization  procedure  recovered  the  natural 
classes,  except  for  the  one  category  consisting  of  poplar  and  birch  leaves. 
Thus,  the  algorithm  converged,  although  not  quite  correctly. 

The  first  observation  to  be  made  is  that  the  algorithm  performs  quite 
well.  Most  tests  performed  on  the  leaves  domain  yielded  results  as  good 
as  those  shown,  or  better,  where  the  solution  was  exactly  the  (botanically) 
correct  categorization.  The  fact  that  an  evaluation  function  based  upon  the 
goals  of  an  observer  and  an  incremental  hypothesis  generation  method  could 
produce  a  natural  and  correct  categorization  provides  empirical  support  for 
the  categorization  principles  embodied  in  the  procedure. 

However,  as  shown,  the  algorithm  does  make  errors,  even  in  a  domain 
where  it  sometimes  generates  the  correct  solution.  After  presenting  another 
example  domain,  we  will  discuss  the  competence  and  behavior  of  the  algo¬ 
rithm,  the  predictable  errors,  and  possible  remedies. 

5.2,5  Example  2:  Bacteria 

To  further  illustrate  the  behavior  of  the  categorization  algorithm,  we  test  the 
procedure  on  a  domain  comprised  of  infectious  bacteria.  For  these  examples, 
property  specifications  for  six  different  species  of  bacteria  were  encoded. 
Table  5.2  displays  the  specifications  for  these  species;  the  data  are  taken 
from  [Dowell  and  Allen,  1981].  Because  most  of  the  “real”  features  take  on 
only  one  value  per  species  (unlike  the  leaves  where  features  like  “length” 

1  A  note  about  the  implementation:  Because  the  categorization  evaluation  function  requires 
that  the  categories  be  sufficiently  large,  categories  that  grow  too  small  because  of  this 
deletion  step  are  themselves  deleted.  The  infinite  observation  sequence  guarantees  that 
these  objects  will  be  viewed  again. 
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Table  5.2:  The  property  specifications  for  six  species  of  bacteria.  Because  most  of 
the  real  features  have  only  one  value  (unlike  the  leaves  where  features  like  “length” 
and  “width”  varied  greatly)  two  noise  features  are  added  (nfl  and  nf2). 


and  “width”  varied  greatly)  two  noise  features  were  added  (nfl  and  nf2). 
These  features  prevented  all  objects  of  the  same  class  from  having  identical 
property  descriptions. 

Of  the  six  species,  three  are  from  the  genus  bacteroides',  these  are  abbre¬ 
viated  as  BF,  BT,  and  BV.  The  other  three  —  FM,  FN,  and  FV  —  are  from 
the  genus  fusobacterium.  Notice  that  several  of  the  features  of  the  speci¬ 
fications  are  determined  by  the  genus,  while  others  are  determined  by  the 
species.  For  example,  all  members  of  bacteroides  have  the  property  “gr-kan 
=  R”  (coding  for  “growth  in  presence  of  Kanamycin  is  resistant”).  Other 
properties,  such  as  “dole,”  vary  between  the  species,  ignoring  genus  bound¬ 
aries.  These  data  were  chosen  as  an  example  of  a  population  in  which  there 
is  more  than  one  natural  clustering.  In  this  chapter  we  are  only  concerned 
illustrating  the  operation  of  the  categorization  algorithm.  In  the  next  chap¬ 
ter  we  consider  the  issue  of  multiple  clusterings,  and  the  interaction  between 
A  and  the  categories  recovered. 

Figure  5.4  displays  the  results  of  executing  the  categorization  procedure 
with  A  set  to  0.65.  Notice  that  the  bacteria  have  been  categorized  according 
to  their  genus.  Is  this  the  “correct”  solution?  As  mentioned  in  chapters  2  and 
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BV-27 :  |  BV-27  BT-15  BT-I  8T-14  BT-17  IFV-Tl  W-JJ  PV-42  FV-44  rS-59  fPH-4Z  rM~4  4  FH-44  FH-44  TV-41 

I BT-1 3  BT-12  IV- 31  BV-25  BV-34  iFV-72  FS-Sl  r*-59  FS-53  IFIHS 

ISF-4  BV-35  BT-23  ST-21  BF-1  BF-7  i  I 

I BV-30  I  l 


r*-si:tFS-si  fm-4S  rs-53  rv-72  1 


BT-23:  IFB-51  Flt-45  ri-H  fV-72  (BT-23  1 


PV-41  :  |  TV- 41  FS-51  F 14-4 5  fS-53  FV-72  lBT-23  I 


rS-59 :  I FS-54  FV-41  FS-51  FH-45  m-S3  FV-72  i*T-23  I 


FH-41:  lfH-44  FS-54  fV-41  P9-S1  FH-45  rf-S3  FV-72  (BT-23  I 


PM-44 :  I  BT-23  ST-21  SF-l  W-7  BV-30  lFV-72  FS-51  PS-99  FS-93 


PS-59 1 I  ST-23  ST-21  BF-i  ST-7  SV-30  lFS-59  FV-72  FS-S1  FS-39  1 


SV— 39 i I SV— 19  ST-23  BT-21  BF-i  BT-7  BV-30 /FS-54  FV-72  FS-91  FS-39 


BF— 4 1 I BF-4  BV-35  BT-23  BT-21  BT-i  BT-7  I rs-9*  FV-72  FS-91  rS-39 
ISV-30  I 


rV-4«ilSF-4  BV-39  ST-23  BT-21  BT-i  BT-7  I  TV -44  t S-9B  FV-72  FS-91 
I BV-30  *  FS-93 


BV— 34 : | 8V— 34  BF-4  BV-39  BT-23  BT-21  BT-1 lFV-44  FS-9B  FV-72  FS-91 
| BF— 7  BV-30  I  FS-93 


BT-15: I BT-1 5  BT-4  BT-1 4  BT-1 7  IT- 13  lFV-71  FS-55  FV-42  FV-44  FS-5B 

) BF- 1 2  SV-31  BV-29  BV-34  BF-4  (FV-72  FS-91  FM-99  FS-93 

I BV-35  BT-23  BT-21  WT- 1  BT-7  BV-30 I 


F-l  BT-7  BV-30 


F-l  BF- '  BV-30 


BF-1  BF-7  SV-30 


BF— 1  ar-7  SV-30 


(F14-44  FM-49  ! 


IFH-41  FH-44  FH-44  rH-44  FV-41 
I  FH-45 


BV-29:  ISV-20  ST-22  BF-1 1  SV-34  BV-24  IFV-49  FV-49  FS-49  FV-43  FV-49  lFH-37  Fit-41  FH-44  FH-44  rH-44 

ISV-27  ST-13  BF-4  ST-14  ST-17  IFS-94  FV-71  PS-99  FV-42  FV-44  l FV-41  FM-49 

(ST-13  BF-12  SV-31  BV-25  SV-34  IFS-59  FV-72  FS-91  FS-54  FS-93  I 

I  BP- 4  BV-39  ST-23  ST-21  BF-1  W-7  »  I 

I BV-30  I  • 

FV-70 : I BV-29  ST-22  BF-11  BV-34  BV-24  |FV-70  FV-49  FV-44  TS-44  FV-43  IFH-37  TH-41  FH-44  FH-44  FH-44 

ISV-27  ST-19  BF-4  ST-14  BT-1 7  lFV-44  FS-54  FV-71  FS-SS  FV-42  (FV-41  rn-45 

I BT-1 3  BF-12  SV-31  BV-25  BV-34  IFV-44  *S-59  FV-72  FS-51  FS-59  I 

ISF-4  BV-35  BT-23  BT-21  BF-1  BF-7  | FS-93  I 

(BV-30  »  * 


FH-40 : I BV-29  BT-22  BP- 11  SV-34  BV-24  BV-27  ST-15  BF-9 
l ST-14  ST-17  ST-13  BF-12  BV-31  BV-25  BV-34  BF-4 
I BV-39  ST-23  BT-21  BF-1  BF-7  BV-30 


I  FH-40  FH-37  FH-41  FH-44  PH- 4  4  FH-44  FV-41  PH-45 
I PV-70  FV-49  FV-44  FS-44  FV-43  FV-44  FS-54  FV-71 
I FS-55  FV-42  FV-44  FS-55  rv-72  FS-51  FB-S9  FM-53 


BT-20 : I  ST- 20  BV-29  BT-22  BF-11  BV-34  BV-24  BV-27  BT-15  1 FH-40  FH-37  FH-41  FH-44  PH-44  FH-45  FV-41  FH-45 

I BF-9  BT-1 4  BT-1 7  BT-1 3  BF-12  BV-31  BV-25  BV-34  BF-4 I FV-70  FV-49  FV-44  TS-49  FV-43  FV-44  rs-54  FV-71 

(BV-35  BT-23  BT-21  BF-1  BF-7  BV-30  .'FS-55  FV-42  FV-44  FS-55  FV-72  FS-51  FN-59  FH-53 

FV-44 : ! ST-20  BV-25  BT-22  BF-11  BV-34  BV-24  BV-27  BT-15  lTV-44  FH-40  FH-37  FH-41  FH-44  FH-44  FM-49  FV-41 

I BF-9  8T-14  BT-1 7  BT-1 3  BF-12  BV-31  BV-25  BV-34  BF-4 I FH-45  FV-70  PV-45  FV-49  FH-49  FV-43  FV-44  FS-5- 

I BV-35  BT-23  BT-21  BT-1  BF-7  BV-30  I FV-71  FS-55  FV-42  FV-44  FS-54  FV-72  FN-51  FS-5* 

|  IFS-53 

Figure  5.4:  Categorizing  bacteria.  In  this  example  A  equals  0.65.  The  categories 
recovered  correspond  to  the  different  genera. 


4,  the  natural  clustering  of  objects  in  the  world  occurs  at  many  levels.  Mam¬ 
mals  and  birds  represent  one  natural  clustering;  cows  and  horses,  another. 
That  the  categorization  procedure  recovered  the  different  genera  is  another 
demonstration  of  the  ability  of  the  algorithm  to  recover  natural  categories. 
These  two  genera  consist  of  two  distinct  types  of  bacteria:  the  bacteroide$ 
are  only  found  in  the  GI  tract  and  the  fusobacterium  are  located  in  the  oral 
cavity.  Thus,  the  categorization  recovered  in  Figure  5.4  is  a  correct  solution. 

5.2.6  Example  3:  Soybean  diseases 

The  last  domain  in  which  we  demonstrate  the  effectiveness  of  the  categoriza¬ 
tion  algorithm  is  that  of  soybean  plant  diseases.  These  data  are  of  interest 
because  they  have  been  used  by  previous  researchers  to  demonstrate  the 
competence  of  clustering  algorithms.  Michalski  and  Stepp  [1983b]  make  use 
of  these  data  to  demonstrate  the  effectiveness  of  their  conceptual  clustering 
technique  (see  discussion  in  chapter  3);  at  the  same  time  they  demonstrate 
that  several  standard  numerical  clustering  techniques  are  not  capable  of  re¬ 
covering  the  correct  categories.  Thus,  these  data  provide  a  means  by  which 
to  measure  the  performance  of  the  categorization  algorithm  relative  to  other 
clustering  procedures. 

Table  5.3  displays  the  property  specifications  for  each  of  four  different 
soybean  plant  diseases;8  these  data  are  derived  from  the  data  presented  in 
Stepp  [1983].  In  their  original  form,  the  data  were  listed  simply  as  property 
vectors  of  several  instances.  In  order  to  provide  a  population  large  enough 
for  the  application  of  the  categorization  algorithm,  a  property  specification 
for  each  species  was  derived  by  taking  the  union  of  the  values  of  the  features 
for  all  instances  of  that  species.  For  example,  the  “time”  feature  for  disease 
Rhizoctonia  Root  Rot  has  the  specifed  values  of  {3, 4, 5, 6};  thus,  each  of  these 
values  occurred  in  at  least  one  property  vector  for  an  instance  of  that  disease.9 
Notice  that  these  properties  contain  much  more  noise  and  are  less  modal 
than  either  of  the  two  previous  examples  of  leaves  and  bacteria.  Successful 

8The  letters  A,  B,  C,  and  D  of  the  top  line  are  used  for  display  in  the  program  output. 

9Another  modification  was  the  deletion  of  constant  features  —  features  that  took  the  same 
value  for  all  instances.  In  chapter  6  we  show  that  constant  features  have  no  effect  on  the 
categorization  uncertainty  measure  V  and  thus  can  be  removed  from  consideration  Re¬ 
moving  extra  features  reduces  the  number  of  feature  subsets  and  makes  the  categorization 
algorithm  more  efficient. 
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categorization  of  this  domain  requires  that  the  algorithm  be  insensitive  to 
unconstrained  features  and  robust  in  its  category  evaluation. 

Figure  5.5  displays  the  results  of  executing  the  categorization  algorithm 
with  a  A  of  .5.  Notice  that  the  correct  categories,  those  corresponding  to 
the  species,  have  been  recovered.  We  should  emphasize  that  the  algorithm  is 
not  told  how  many  categories  are  present,  unlike  that  of  Michalski  and  Stepp 
[1983b].  Rather,  the  algorithm  discovers  the  appropriate  number  of  classes  in 
its  search  for  natural  categories.10  The  fact  that  the  categorization  procedure 
is  capable  of  recovering  the  correct  categories  in  this  complex  domain  -  a 
domain  in  which  other  clustering  techniques  have  failed  —  validates  the 
algorithm  as  a  useful  categorization  technique. 

5.3  Categorization  competence 

We  have  demonstrated  the  effectiveness  of  the  categorization  algorithm  in 
several  domains.  However,  as  shown  in  the  leaves  example,  the  algorithm 
does  not  always  converge  to  the  correct  solution,  even  in  a  domain  where 
it  sometimes  does  produce  the  correct  categorization.  To  understand  the 
behavior  of  the  categorization  procedure  we  need  to  analyze  the  competence 
of  the  algorithm.  The  case  we  consider  is  when  there  are  only  two  classes  of 
objects  in  a  population.  The  study  of  this  problem  will  also  provide  insight 
into  the  behavior  of  the  algorithm  when  there  are  more  classes  present.  We 
assume  that  the  representation  is  class  preserving,  making  the  categorization 
task  possible.  The  issue  is  whether  the  algorithm  will  recover  two  categories 
corresponding  to  the  two  classes. 

Because  we  start  with  a  categorization  consisting  of  one  category,  the 

10To  be  compl  we  should  mention  how  the  categorization  of  the  soybean  diseases  varies 
as  we  change  A;  the  value  of  A  can  affect  the  categories  that  are  recovered.  In  fact, 
unlike  the  leaves  or  the  bacteria  example,  there  does  exists  another  categorization  that 
is  reliably  recovered  by  the  categorization  algorithm.  When  the  value  of  A  is  .65,  the 
recovered  categorization  consists  of  the  three  categories  A,  B,  and  {C,D}.  This  situation 
indicates  that  there  are  two  natural  levels  of  categorization  in  this  domain.  In  chapter 
6  we  explore  the  issue  of  multiple  modal  levels,  where  more  than  one  level  of  constraint 
is  operating  in  a  population.  Our  primary  example  in  that  chapter  will  be  the  bacteria 
where  the  genera  and  the  species  provide  multiple  levels  of  constraint.  However,  because 
we  do  not  have  any  objective  evidence  of  multiple  levels  within  the  soybean  domain,  we 
present  '.he  multiple  categorizations  of  the  soybean  diseases  in  Appendix  B. 
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Figure  5.5:  Execution  of  the  categorization  algorithm  in  the  domain  of  soybean 
plant  diseases  described  in  Table  5.3.  For  this  example,  the  value  of  A  is  .5.  The 
categorization  algorithm  successfully  recovers  the  species. 


answer  to  the  above  question  depends  upon  whether  the  algorithm  will  ever 
split  a  category  containing  both  classes  into  two  categories  approximating  the 
natural  classes.  By  “approximating”  we  mean  that  the  two  new  categories 
formed  are  such  that  any  new  objects  seen  will  be  categorized  according  to 
their  true  classes.  Any  mistakes  caused  by  an  inexact  split  would  be  corrected 
when  the  objects  are  viewed  again  at  a  later  time;  because  the  observation 
sequence  contains  an  infinite  number  of  repetitions  of  each  object  description 
we  know  that  every  object  will  be  encountered  again.  We  refer  to  these  ap¬ 
proximate  categories  as  captivating  categories:  they  capture  all  new  objects 
presented  that  are  members  of  the  appropriate  class.  Notice  that  whether  a 
category  is  captivating  depends  upon  the  hypothesis  evaluation  function  (in 
this  case  the  uncertainty  measure  U)  as  well  as  the  other  categories  present: 
the  evaluation  function  determines  to  which  category  an  object  is  assigned. 

For  the  moment,  let  us  suppose  that  the  only  partition  of  the  category 
containing  two  classes  that  yields  a  better  categorization  than  the  original 
is  the  partition  that  exactly  separates  the  two  classes.  We  wish  to  know 
whether  that  one  partition  will  be  discovered.  We  assume  that  the  original 
category  is  formed  by  viewing  an  observation  sequence  which  is  unbiased  in 
its  distribution  of  objects  from  the  two  classes.  Thus,  we  assume  that  we 
have  a  current  categorization  consisting  of  one  category  of  size  2 n  and  that 
contained  in  the  single  category  are  n  objects  of  each  of  the  two  classes.11 
We  also  assume  that  only  partitions  of  two  equal  sized  categories  are  consid¬ 
ered  as  possible  split  categorizations;  there  are  (2")/2  such  partitions.  If  the 
hypothesis  generation  method  proposes  only  a  fixed  number  of  partitions  at 
each  iteration,  then  the  probability  of  finding  the  one  categorization  corre¬ 
sponding  to  the  correct  categories  rapidly  decreases  as  the  single  category 
size  (2n)  increases.  Because  (2n )/ 2  »  2n,  the  probability  of  finding  the  cor¬ 
rect  split  during  the  current  iteration  quickly  vanishes  to  zero  as  new  objects 
are  added  to  the  single  category.  The  exponential  rate  of  increase  in  the 
number  of  partitions  guarantees  this  is  true  even  if  the  number  of  partitions 
considered  increases  polynomially  with  the  size  of  the  single  category.  (In 
the  current  implementation  the  number  of  partitions  considered  increased 
linearly  with  the  size  of  the  category.)  Thus,  having  a  high  probability  of 
discovering  the  correct  class-separating  categorization  requires  doing  so  be- 


fore  the  single  category  becomes  “too  large.” 

To  further  refine  our  analysis  of  the  combinatorics  of  category  formation, 


it  is  necessary  to  make  some  stronger  assumptions  about  the  objects  being 
categorized.  Let  us  consider  the  case  of  a  modal  world.  Here,  the  object 
classes  and  features  are  such  that  every  feature  takes  on  a  different  value  for 
each  different  class.  Let  F  be  the  number  of  features  in  the  current  repre¬ 
sentation.  We  continue  to  assume  that  there  are  only  two  natural  classes 
present;  thus  each  feature  takes  on  only  two  values  over  the  entire  popula¬ 
tion.  The  uncertainty  measure  U  has  been  constructed  such  that  in  a  modal 
world,  the  best  possible  categorization  was  that  which  separated  the  modal 
classes.  The  question  we  wish  to  consider  is  what  is  the  probability  that 
the  categorization  procedure  presented  above  will  indeed  separate  the  two 
classes  present. 

Let  us  define  a  fc-overlap  partition  of  a  two-class  category  as  a  partition 
where  k  objects  of  each  class  are  in  the  wrong  category.  That  is,  both 
categories  of  the  partition  contain  n  —  k  objects  of  one  class  (called  the 
primary  class)  and  k  of  the  other.  Thus,  k  <  (n/2).  Because  each  category 
contains  k  objects  that  axe  members  of  a  class  of  which  there  are  a  total  of 
n  objects,  there  are  (it)2  fc-overlap  partitions  of  a  category  of  2n  objects.'2 

The  importance  of  ^-overlap  partitions  is  their  role  in  following  propo¬ 
sition,  whose  proof  we  postpone  until  we  derive  analytic  expression  for  the 
total  uncertainty  U  of  a  fc-overlap  partition: 

Proposition  5.1  In  a  modal  world,  the  categories  of  a  k-overlap  partition 
are  captivating  for  their  primary  classes  if  k  <  (n/2)  and  if  the  uncertainty 
measure  U  is  used  as  the  hypothesis  evaluation  function. 

This  proposition  implies  that  the  creation  of  a  ^-overlap  partition  of  a  modal 
world  is  sufficient  to  guarantee  that  the  modal  classes  will  be  recovered  by 
the  categorization  procedure:  once  the  categories  become  captivating,  all  new 
objects  are  categorized  correctly  and  previous  mistakes  are  corrected  when 
the  incorrectly  categorized  objects  axe  viewed  again.  Thus  the  probability  of 
correctly  categorizing  the  modal  world  is  equivalent  to  the  probability  that 
a  k- overlap  partition  is  created  by  the  splitting  step  of  the  categorization 
procedure.  To  determine  this  probability  we  first  need  to  derive  an  expres- 


sion  for  the  total  uncertainty  of  a  ^-overlap  partition  relative  to  the  total 
uncertainty  of  the  single  category  categorization. 

The  total  uncertainty  for  the  single  category  categorization  is  simple  to 
derive.  Because  there  is  only  one  category,  there  is  no  category  uncertainty: 
Uc  =  0.  Because  each  of  the  F  features  takes  on  only  two  values,  and 
because  they  are  evenly  distributed,  the  property  uncertainty  Up  is  equal 
to  F  •  (  —  jlog  ~  —  jlog  |)  =  F.  Thus  the  total  uncertainty  U  of  the  single 
category  categorization  is  (1  —  A )F. 

The  expressions  for  the  t-overlap  partition  of  the  modal,  two-class  world 
are  more  complicated.  First,  because  Uc  is  non-zero  when  k  ^  0,  we  need 
to  compute  the  normalization  factor  tj  —  the  ratio  between  the  property 
uncertainty  of  the  coarsest  categorization  (a  single  category)  and  the  cate¬ 
gory  uncertainty  of  the  finest  categorization.  We  have  already  shown  that 
Up(Coarsest(2))  is  equal  to  F.  It  is  easy  to  show  that  Uc(Finest(Z ))  = 
logn:  If  each  object  is  its  own  category,  then  in  this  two  class,  modal  world 
there  are  two  sets  of  n  identical  categories.  Because  objects  of  modal  classes 
have  identical  properties,  all  subsets  of  features  are  equally  diagnostic;  each 
subset  is  consistent  with  membership  in  n  categories.13  Thus  the  category 
uncertainty  is  log?r  for  each  subset  of  features,  yielding  an  average  category 
uncertainty  of  the  same  value:  Uc  —  log  n. 

To  develop  the  expressions  for  Up  and  Uc  for  a  fc-overlap  partition,  we  will 
explicitly  derive  them  for  the  case  k  =  1.  The  case  k  >  1  follows  analogously. 
First,  consider  the  property  uncertainty  of  the  1-overlap  partition.  For  each 
feature,  there  are  two  T  dues  present  in  each  category;  one  value  (that  of  the 
primary  class  objects)  occurs  n  —  1  times,  the  other  value,  only  once.  Thus 
the  property  uncertainty  (for  each  category,  and  therefore  for  their  average) 
is  given  by: 

Up 1  =  F-  f--log  (-)  -  —  log  (— )|  (5.2) 

n  \ti  /  n  \  n  / 

or 

CV  =  F-  [i log  (=)  +  log  (^y)]  (5.3) 

13For  the  analysis  presented  here,  we  assume  that  the  null  set  is  not  an  allowed  subset  of 
features.  Otherwise,  for  that  one  subset,  the  number  of  categories  consistent  with  the 
(null)  property  description  would  be  2 n. 


The  derivation  of  Uc  is  similar.  Because  we  are  considering  a  1-overlap 
partition  of  a  modal  world,  any  subset  of  the  properties  of  an  object  (as  before 
we  exclude  the  null  set)  is  consistent  with  n  —  1  objects  in  one  category  (the 
correct  “category”)  and  1  object  in  the  other.  Thus  the  category  uncertainty 
remains  constant  for  all  objects  and  for  all  subsets.  Its  value  (and  also  the 
value  of  Uc  for  the  1-partition,  because  Uc  is  simply  the  average  of  this 
constant  value)  is  given  by: 

The  argument  is  the  same  for  the  general  ^-overlap  partition,  where  k  < 
h/I.  The  complete  expressions  are: 

^  =  f.glog(=)  +  ^lo6(_^)]  (55) 


Uck  = 


Two  properties  of  the  above  expressions  are  worth  noting.  First,  Up  =  Uc  — 
0  when  k  —  0.14  This  situation  is  equivalent  to  an  exact  partition  of  the 
single  category  into  the  two  modal  classes.  By  design,  the  total  uncertainty 
of  a  categorization  that  separates  modal  classes  is  zero. 

Second,  by  taking  the  derivative  of  the  above  expressions  with  respect 
to  k,  one  can  show  that  both  of  the  above  quantities  achieve  a  maximum 
at  k  =  nj 2;  also,  the  total  uncertainty  increases  monotonically  as  k  in¬ 
creases  from  zero  to  n/2.  Thus,  unsurprisingly,  the  total  uncertainty  of  a 
^.-overlap  partition  increases  as  the  class  overlap  of  the  categories  increases 
and  is  maximized  when  exactly  half  the  objects  of  each  class  are  contained 
in  each  of  the  two  categories.  However,  stated  in  a  different  form,  this  result 
becomes  important.  Suppose  we  have  a  Ar-overlap  partition  where  k  <  n/2; 
the  inequality  assures  that  each  category  has  a  primary  class  to  which  it 
corresponds.  Next,  suppose  we  view  a  new  object.  The  fact  that  the  total 
uncertainty  increases  as  the  degree  of  overlap  increases  implies  that  if  we 
add  that  object  to  the  incorrect  category  —  the  category  whose  primary 
class  does  not  correspond  to  the  class  from  which  the  new  object  was  drawn 


More  precisely  the  lim*_ol'Iog  (r)  =  0 
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—  then  we  will  increase  the  total  uncertainty  U.  Similarly,  adding  the  new 
object  to  the  correct  category  will  reduce  total  uncertainty.  Therefore,  to 
minimize  total  uncertainty  we  will  always  add  new  objects  to  the  category 
corresponding  to  their  primary  class.  As  such,  we  have  now  proven  proposi¬ 
tion  5.1:  the  categories  of  a  fc-overlap  partition  of  a  modal  world  form  a  set 
of  captivating  categories  if  k  <  n/2  and  if  the  total  uncertainty  measure  U 
is  used  as  the  hypothesis  evaluation  function. 

Using  the  above  results  we  can  now  determine  the  probability  that  k- 
overlap  partition  will  be  formed  when  categorizing  a  modal  world.  Let  us 
compute  the  ratio  between  the  total  uncertainty  the  split  (the  fc-overlap  par¬ 
tition)  and  single  category  categorizations.  Combining  the  above  results  and 
including  the  necessary  normalization  term  yields  the  following  expression 
for  this  ratio  (referred  to  by  p): 


log  n  —  —  log  k - log  (n  —  k) 

n  n 


P  =  1  + 


1  —  A  J  log  n 


p  represents  the  decision  function  used  in  category  splitting  step  (6)  of  the 
categorization  algorithm  for  the  restricted  case  of  two  modal  classes.  If  p 
is  less  than  1.0  then  the  uncertainty  of  the  split  categorization  is  less  than 
the  single  category  categorization  and  is  thus  to  be  accepted.  Notice  that 
p  increases  with  A.  That  is,  it  is  more  difficult  to  split  a  category  when 
the  value  of  A  is  high.  This  behavior  is  to  be  expected.  Higher  values  of  A 
cause  the  uncertainty  measure  U  to  weight  the  category  uncertainty  Uc  more 
heavily  than  the  property  uncertainty  Up;  coarse  categorizations  with  few 
categories  are  preferred  over  finer  categorizations.  Thus  a  higher  value  of  A 
makes  it  more  difficult  to  accept  a  split  categorization  over  a  single  category. 

Using  equation  5.7  we  can  compute  the  maximum  k  such  that  a  fc-overlap 
partition  has  lower  total  uncertainty  U  than  the  single  category  categoriza¬ 
tion.  Table  5.4  lists  the  maximum  k  for  different  values  of  n  and  A;  the 
fractional  below  is  the  proportion  of  equal  size  partitions  of  a  set  of  2 n  ob¬ 
jects  that  are  ^-overlap  partitions  for  k  less  than  or  equal  to  the  maximum. 
For  example,  when  there  are  16  total  objects  (n  =  8)  and  A  =  .4,  the  maxi¬ 
mum  acceptable  value  of  k  is  2.  Thus,  ^.’-overlaps  of  k  £  {0, 1, 2}  have  a  lower 
uncertainty  than  the  single  category  categorization.  As  indicated,  this  set  of 
partitions  constitutes  13%  of  the  all  partitions  of  2n  objects  into  equal-sized 
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Table  5.4:  Maximum  k  such  that  the  ifc-overlap  partition  has  a  lower  uncertainty  U 
than  the  single  category  categorization.  The  fractional  number  below  represents  the 
proportion  of  equal  sized  partitions  of  the  2 n  objects  that  are  t-overlap  partitions 
with  k  less  than  or  equal  to  the  maximum  value. 

categories.  Notice  that  for  n  >  15  the  percentage  of  acceptable  partitions  is 
almost  zero  for  all  but  the  lowest  values  of  A.  Thus,  we  begin  to  see  that 
category  formation  must  occur  before  the  initial  category  size  (2n)  becomes 
greater  than  20  or  so. 

Using  the  maximum  k  values,  we  can  compute  the  probability  of  success 
of  the  splitting  step  (6)  of  the  categorization  procedure.  This  probability 
depends  upon  how  many  partitions  of  the  population  are  evaluated  in  each 
iteration  of  the  algorithm;  we  let  p  represent  the  number  of  attempts.  We 
compute  the  incremental  ■probability  of  success  (incremental  because  it  refers 
to  only  one  iteration  of  the  procedure)  by  computing  the  probability  that 
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0.90 

0.13497 

0.03906 

0.01079 

0.00079 

0 . 00005 

o . oonoo 

0 . ooooo 

Hypfltheies  per 

iteration: 

10 

n: 

4 

5 

6 

§ 

10 

15 

20 

lambda: 

0.10 

0.99071 

0.90005 

0.99977 

0.99994 

0 . 86069 

0.99012 

0 . 90503 

0.70 

0.99071 

0.90005 

0 . 56607 

0.75705 

0.96069 

0.70657 

0.69797 

0.30 

0.99071 

0.90005 

0.5660? 

0.75705 

0.06069 

0.23010 

0.27079 

0.40 

0.75164 

0.90005 

0 . 5660? 

0.75705 

0.20771 

0.23010 

0.03707 

0.50 

0.75164 

0.07659 

0 . 56602 

0.09654 

0.20771 

0.07779 

0.0370? 

0.00 

0.25164 

0.07659 

0.02144 

0.09454 

0 . 01 090 

0.02779 

0 . 00359 

o./o 

0  25164 

0.07659 

0.02144 

0 . 09654 

0.01088 

0.00145 

0.00019 

0.90 

0.75164 

0.07659 

0.02144 

0.00155 

0 . 0001 1 

0.00003 

0.00001 

0.90 

0.25164 

0.07659 

0.02144 

0.00155 

0.0001 1 

0.00000 

o .  nonoo 

Hypotheses  per 

i  terat ion: 

70 

n: 

4 

5 

6 

8 

10 

15 

70 

l ambda : 

0.10 

1  .00000 

0.99017 

1 .00000 

1 . 00000 

0 . 90059 

1 .00000 

0.99970 

0.20 

1 .00000 

0.99017 

0.91166 

0.94097 

0.90059 

0.95445 

0.90879 

0.30 

1  .00000 

0.99017 

0.91166 

0.94U97 

0 . 98059 

0.41963 

0.40523 

0.40 

0.43996 

0 . 9901 7 

0.91166 

0.94097 

0.37228 

0.41963 

0.0M70 

0.50 

0.43996 

0.14731 

0.91166 

0. 19376 

0.37229 

0.05481 

0.07470 

0.60 

0.43996 

0.14731 

0.04241 

0.18376 

0.02164 

0 . 05401 

0.0071 7 

0.  fO 

0 . 43996 

0. 14731 

0.04241 

0.  19376 

0.02164 

0.00290 

0.00039 

0.90 

0 . 43996 

0.14731 

0 . 114241 

0.00310 

0 . 0002? 

0.00006 

0.00001 

0.90 

0.43996 

0.14731 

0. 04241 

0.00310 

0 .  (1002? 

0.00000 

o .  onoon 

Table  5.5:  Probability  of  a  successful  split  if  5,  10,  or  20  attempts  are  considered 
This  probability  is  for  a  single  iteration. 
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none  of  the  attempted  partitions  is  a  A;-overlap  partition  of  a  sufficiently 
low  k.  Assuming  an  independent  sampling  of  partitions,  the  probability  of 
failure  is  simply  the  proportion  of  partitions  that  do  not  satisfy  maximum  k- 
overlap  condition  multiplied  p  times.  The  probability  of  success  is  then  one 
minus  this  failure  rate.  Table  5.5  lists  values  of  the  incremental  probability 
of  a  successful  a  split  as  a  function  of  n,  A  and  p.  For  example,  if  there  are 
n  =  8  objects  in  each  class,  if  A  has  been  set  to  .5,  and  if  p  =  10  partitions 
are  attempted,  then  the  incremental  probability  of  success  is  0.0965.  It 
is  important  to  note  that  because  the  probability  of  failure  is  the  repeated 
product  of  a  number  less  than  one  —  the  percentage  of  partitions  that  overlap 
too  many  objects  of  the  modal  classes  —  the  incremental  probability  of 
success  can  be  made  arbitrarily  close  to  unity  by  increasing  p. 

To  determine  whether  a  single  category  categorization  will  at  any  point 
be  split  into  t-overlap  partition  of  two  categories,  we  compute  the  prob¬ 
ability  that  every  iteration  through  the  algorithm  fails  to  split  the  single 
category.  Let  us  define  p/(n,A,p)  to  be  the  probability  of  failing  to  split 
a  category  in  a  given  iteration;  p/  is  equal  to  one  minus  the  probability  of 
success.  Then,  assuming  independence  between  iterations,  the  probability 
of  never  succeeding  in  splitting  the  single  category  categorization  is  simply 
the  product  of  the  probability  of  failing  at  each  step.  Because  the  number  of 
objects  increases  by  one  with  each  iteration  of  the  categorization  procedure, 
but  our  equations  are  only  defined  for  an  even  total  number  of  objects  (2 n) 
we  approximate  this  product  as  follows: 

OO 

Prob.  of  success  |no,A,^  =  1  -  JJ  [p/(n,  A,  p)  •  p/(n -f  1,  A,  p)]  (5.8) 

n0 

where  n0  is  the  initial  n,  equal  to  half  the  size  of  the  initial  category.  It  can 
be  shown  that  this  approximation  is  conservative  in  that  it  under-estimates 
the  probability  of  success.  Thus,  we  can  finally  compute  the  probability 
that  the  incremental  hypothesis  generation  method  will  correctly  categorize 
a  modal  world  of  two  classes. 

Equation  5.8  allows  us  to  compute  the  probability  of  a  successful  cate¬ 
gorization  for  given  starting  no,  A,  and  p;  the  results  are  displayed  in  Table 
5.6.  Several  observations  should  be  made  about  about  these  results.  First, 
when  the  starting  n0  is  small,  and  when  p  equals  10  or  20,  the  probability  of 
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Succm  Probability  I  obi* 
Hypothtifi  p«r  ( tarot  ian:  5 


Start  n: 
Lanbda: 

0.10 

4 

5 

1  .00000 

6 

1 .00000 

8 

1 . 00000 

10 

1 . 00000 

15 

1 . 00000 

20 

0 . 96393 

0.20 

i .ooeeo 

1 .00080 

1 .00000 

1 . 00000 

1 . 00000 

0.9984) 

0.60276 

0.30 

1 .00000 

1.00000 

1 .00000 

0.99986 

0.99859 

0.90314 

0.17584 

0.40 

0.93337 

0.93768 

0.98883 

0.95377 

0.81115 

0.49557 

0.07944 

0-50 

0 . 83540 

0.87417 

0.80123 

057333 

0.47836 

0.14445 

0.02729 

0.60 

0.50400 

0.49957 

0. 47356 

0.24747 

0 . 1 7659 

0.03409 

0.00250 

0.70 

0.33660 

0.20197 

0.16048 

0.10193 

0.01 733 

0.00220 

0.00013 

0.80 

0.22938 

0.07178 

0.02353 

0 . 00635 

0.00511 

0.00010 

o . onnno 

0.80 

0.22443 

0.06709 

0.01860 

0.00134 

0 . 00009 

0 . 00000 

0 . 00000 

Hypotheses  par 

1  tarat  ion: 

10 

Start  n: 
Lambda: 

4 

5 

6 

8 

10 

15 

20 

0.10 

1 .00000 

1.00000 

1 . 00000 

1 . 00000 

1 . 00000 

1 . 00000 

0.99070 

0.20 

l  .00000 

1 .00000 

1 . 00000 

1 . 00000 

1 . 00000 

1 . 00000 

0.84220 

0.30 

1.00000 

1 . 00000 

1 . 00000 

1 . 00000 

1 . 00000 

0.9906? 

0.32077 

0.40 

1 . 00000 

0.99999 

0.99908 

0.99706 

0.96434 

0.74555 

0.15257 

0.50 

0.983O6 

0.98417 

0 . 96049 

0.81795 

0.72789 

0.26803 

0.053B4 

0.60 

0.82635 

0.74957 

0 . 72286 

0.43371 

0.32200 

0.06856 

0.00499 

0.70 

0 . 55330 

0.36314 

0.29521 

0.19346 

0.03436 

0.00439 

0.00076 

0.80 

0 . 40460 

0 . 1 3840 

0. 04650 

0.01766 

0.01020 

0.00020 

0.00001 

0.30 

0.33858 

0.12969 

0.03686 

0.00767 

0 .0001 9 

n . oonoo 

0 . 00000 

Hypotheses  par 

i  tarat ion: 

20 

Start  n: 

Lambda : 

4 

5 

6 

8 

10 

15 

20 

0.10 

1.00000 

1 .08000 

1 .00000 

1 . 00000 

1 . 00000 

1 . 00000 

1 . 00080 

9.20 

1 .80008 

1.88088 

1 . 00000 

i . oeooo 

1 . 00000 

1 . 00000 

0.97510 

0.30 

1 .00000 

1 .00008 

1 . 00000 

1 . 00000 

1 .00000 

0.99991 

0.53864 

9.40 

1.00000 

(  .80888 

1 . 00000 

1 . 00000 

0.99873 

0.93526 

0.28186 

9.50 

0 . 33308 

0 . 99975 

0.99044 

0.96686 

0.92596 

0.46423 

0.10479 

0.60 

0.37085 

0.93729 

0.92320 

0.67931 

0.54031 

0.13242 

0.00995 

0.70 

0.80632 

0.59441 

0.50327 

0 . 34950 

0.06754 

0.00077 

0 . 0005? 

0.80 

0 . 64558 

8.29765 

0 . 09084 

8.02515 

0.02030 

0 . 00040 

0.00001 

0.30 

0 . 63923 

8.24256 

0.07735 

0.00533 

0.00037 

0 . 00000 

0 . 00000 

Table  5.8:  Probability  of  successful  categorization  of  the  two  class  modal  world, 
for  different  starting  values  of  n  and  different  values  of  A  and  n- 
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success  is  quite  high  for  most  A.  (In  the  current  implementation,  n0  =  4  and 
/x  averages  about  10.)  Second,  the  success  probability  increases  with  /x,  con¬ 
firming  our  earlier  statement  that  the  probability  of  success  can  be  raised  by 
increasing  /x.  Finally,  as  no  becomes  large,  the  probability  of  success  drops 
rapidly.  For  example,  for  A  =  .6  and  n  =  10,  the  probability  of  success  drops 
from  32%  to  7%  as  n o  increases  from  10  to  15. 

To  conclude  our  analysis,  we  consider  two  cases  that  deviate  slightly  from 
the  previous  conditions.  First,  is  the  case  of  more  than  two  modal  classes. 
Suppose  there  are  3  objects  of  each  of  four  different  classes  (A,  B,  C,  D)  for 
a  total  of  12  objects.  In  the  previous  analysis,  this  situation  corresponded 
to  the  case  n  —  6.  Referring  to  table  5.4,  we  note  that  for  A  =  .6  only  the  0- 
overlap  partition  is  acceptable;  there  is  only  1  such  partition,  for  a  probability 
of  .002.  In  the  new  case  of  4  classes,  however,  there  are  3  possible  partitions 
({AB,  CD),  { AC ,  BD},  {AD,  BC})  that  cause  no  overlap  between  classes. 
Each  of  these  partitions  yields  a  reduction  in  the  property  uncertainty  with 
no  corresponding  increase  in  category  uncertainty;  each  has  a  lower  total 
uncertainty  than  the  single  category  categorization.  Thus,  with  more  classes 
present  the  task  of  initially  forming  categories  is  easier. 

The  second  variation  is  the  case  where  the  single  category  contains  an 
unequal  number  of  objects  from  the  two  classes;  say  n  of  one  class  of  objects 
and  m  of  another,  where  n  >  m.  In  this  case  the  question  arises  of  whether 
the  observer  attempts  to  form  unequal  sized  partitions.  Although  doing 
so  will  permit  him  to  possibly  recover  the  exact  partition,  the  increase  in 
the  possible  number  of  partitions  —  the  number  of  partitions  is  now  on 
the  order  of  2"+Tn  as  opposed  to  the  previous  case  of  2n  —  makes  such  a 
strategy  unlikely  to  succeed.  If,  however,  the  observer  only  attempts  equal 
sized  partitions,  then  even  the  best  possible  partition  will  result  in  n  —  m 
objects  being  in  the  wrong  category.  Thus,  a  smaller  percentage  of  the 
partitions  will  be  preferred  over  the  single  category  than  the  case  where 
there  is  an  equal  number  of  objects  from  each  category.  Recovering  categories 
corresponding  to  the  natural  classes  is  more  difficult  when  the  objects  are 
unevenly  distributed  in  the  initial  categorization. 

To  summarize,  we  have  determined  the  theoretical  competence  of  the  cat¬ 
egorization  procedure  for  the  ideal  case  of  a  two-class,  modal  world.  In  par¬ 
ticular,  we  have  shown  that  the  probability  of  successful  categorization  can 
be  made  arbitrarily  high  by  increasing  the  number  of  partitions  considered  at 
each  iteration  (/ x ).  Also,  for  values  of  /x  used  in  the  current  implementation 


and  for  a  wide  range  of  A,  the  probability  of  success  has  been  shown  to  be 
quite  high.  Finally,  we  have  argued  that  if  more  than  two  classes  of  objects 
are  present,  the  formation  of  categories  is  easier  (for  the  same  number  of 
total  objects)  because  a  larger  percentage  of  the  possible  partitions  satisfy 
the  fc-overlap  conditions. 

5.4  Improving  performance:  Internal 
re-categorization 

In  the  previous  section  we  demonstrated  that  given  a  modal  world,  the  prob¬ 
ability  of  a  successful  categorization  is  high.  However,  often  the  probability 
is  less  than  one.  Also,  real  data  is  not  purely  modal,  making  categoriza¬ 
tion  formation  it  rt  difficult;  noise  features  mask  category  structure.  Thus 
we  can  expect  errors,  of  the  form  shown  in  Figure  5.2:  two  classes  grouped 
together  in  one  category. 

The  results  of  the  previous  section  show  that  we  can  reduce  the  proba¬ 
bility  of  this  type  of  error  by  increasing  n  —  the  number  of  category  splits 
attempted  during  each  iteration  of  the  categorization  procedure.  However, 
the  evaluation  of  a  partition  is  an  expensive  computation.15  Also,  as  n  be¬ 
comes  large,  the  proportion  of  acceptable  partitions  is  so  small  that  we  would 
require  /r  to  be  huge  before  a  reasonable  chance  of  success  was  attained.  This 
sparse  distribution  of  helpful  partitions  in  the  combinatoric  space  of  possi¬ 
ble  partitions  resulted  in  the  poor  performance  of  the  random  partitioning 
algorithm  of  Frotier  and  Solomon  [1966].  Thus  we  would  prefer  a  better 
solution. 

One  such  possibility  is  simply  re-categorizing  each  category  in  an  incre¬ 
mental  fashion.  Consider  again  the  last  categorization  shown  in  Figure  5.3. 
Let  us  assume  that  no  split  of  any  of  the  single  class  categories  would  be  ac¬ 
cepted,  as  is  always  true  in  the  modal  case.  If  we  were  to  re-categorize  each 
category  independently,  forming  a  new  observation  sequence  for  each,  then 

15In  the  current  implementation  the  decision  as  to  whether  split  a  category  is  made  locally. 
The  program  considers  each  category  as  its  own  population,  and  evaluates  the  proposed 
split  relative  to  the  single  category  categorization  The  normalization  factor  used,  how¬ 
ever,  is  the  one  in  based  upon  the  total  population.  Otherwise,  it  would  have  the  effect 
of  scaling  A.  This  local  decision  can  be  shown  to  be  approximately  equivalent  to  deciding 
each  split  by  considering  the  entire  categorization  of  objects. 
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there  is  some  (empirically  shown  to  be  high)  probability  that  the  poplar  and 
the  birch  leaves  would  be  properly  separated.  By  repeated  application  of 
this  procedure,  the  probability  of  a  correct  classification  can  again  be  made 
arbitrarily  high. 

Figures  5.6  and  5.7  illustrate  an  example  of  the  implementation  of  such 
a  re-categorization  procedure.  Figure  5.6  displays  the  result  of  executing 
the  categorization  procedure  on  a  population  of  80  leaves,  16  each  of  the 
species  oak,  maple,  elm,  birch,  and  poplar.  The  last  categorization  listed 
shows  the  resultant  categorization  formed  after  sequentially  viewing  the  en¬ 
tire  population.16  In  this  example,  the  birch  and  the  elm  leaves  have  not  been 
separated  into  distinct  categories.  However,  we  can  re-execute  the  catego¬ 
rization  procedure  using  the  combined  category  to  form  a  new  observation 
sequence;  when  we  do  so,  the  categorization  procedure  correctly  separates 
the  classes  (Figure  5. 7). 17  Though  not  shown,  we  should  mention  that  at¬ 
tempts  to  internally  re-categorize  the  other  categories  do  not  yield  any  new 
divisions.  The  species  categories  approximate  modal  classes;  any  partition 
of  the  species  produces  a  categorization  with  greater  toted  uncertainty. 

At  this  point  we  conclude  our  discussion  of  methods  of  improving  the 
performance  of  the  categorization  procedure.  There  are  two  reason  not  to 
continue  exploring  methods  of  improving  the  statistical  performance  of  the 
algorithm.  First,  the  efficiency  issues  involved  are  not  directly  related  to  the 
question  object  categorization,  but  are  more  questions  of  statistical  sampling; 
for  example,  the  current  implementation  was  improved  by  cycling  through 
the  objects  sequentially,  instead  of  generating  an  observation  sequence  by 
randomly  sampling  the  population.  Second,  and  more  importantly,  improv¬ 
ing  the  behavior  of  the  categorization  procedure  by  increasing  its  efficiency 
does  not  address  the  fundamental  question  underlying  object  categorization: 
what  information  can  be  provided  to  the  observer  to  facilitate  the  recovery  of 

16Notice  i':at  several  oak  leaves  are  missing.  To  make  the  implementation  of  the  categoriza¬ 
tion  procedure  more  robust,  categories  that  become  too  small  are  deleted.  Because  every 
object  is  guaranteed  to  be  repeated  in  the  observation  sequence,  these  deleted  objects 
will  be  categorized  again.  Also,  one  birch  leaf  is  contained  in  the  poplar  category.  This 
particular  object  was  mistakenly  categorized  early  in  the  observation  sequence.  When 
it  is  viewed  again,  —  the  infinite  observation  sequence  guarantees  that  it  will  be  seen 
repeatedly  —  the  mistake  would  be  corrected. 

17The  normalization  coefficient  is  not  re-computed  when  re-categorizing  the  single  category. 
Its  purpose  it  to  normalize  Up  and  Uc  with  respect  to  entire  population. 
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Figure  5.«:  Another  example  of  the  final  categorization  yielding  a  category  con¬ 
taining  two  classes. 
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Figure  5.7:  Re-categorizing  the 
leaves  of  Figure  5.6. 


single  category  containing  the  birch  and  elm 


natural  object  categories.  In  the  conclusion  of  this  thesis,  when  we  consider 
potential  extensions  to  this  work,  we  will  return  to  the  issue  of  recovering 
natural  categories. 


Chapter  6 


Multiple  Modes 


The  Principle  of  Natural  Modes  states  that  objects  in  the  natural  world 
cluster  in  dimensions  important  to  the  interaction  between  objects  and  their 
environment.  However,  clustering  may  occur  at  many  levels:  mammals  and 
birds  represent  one  natural  grouping;  cats  and  dogs  another.  This  hierarchy 
of  clusters  —  multiple  modal  levels  —  occurs  because  of  hierarchical  pro¬ 
cesses  involved  in  the  formation  of  objects;  a  “dog”  may  be  described  as  a 
composite  of  the  processes  that  create  mammals  and  those  that  distinguish 
a  dog  from  other  mammals.  Each  process  imposes  a  regularity  on  the  ob¬ 
jects,  making  inferences  about  the  properties  of  these  objects  possible.  For 
example,  all  mammals  are  warm-blooded  and  have  hair. 

We  have  claimed  that  the  goal  of  the  observer  is  to  recover  categories  of 
objects  corresponding  to  natural  clusters.  But  this  task  is  complicated  by  the 
presence  of  multiple  modal  levels.  The  properties  constrained  by  one  process 
may  be  independent  of  those  constrained  by  another.  Thus,  if  the  different 
properties  of  objects  encoded  by  the  observer  are  constrained  by  different 
processes,  then  the  category  structure  reflected  in  one  set  of  properties  is 
masked  by  other  sets.  The  purpose  of  this  chapter  is  to  explore  these  issues. 

We  present  a  solution  of  the  multiple  mode  problem  that  first  entails  iden¬ 
tifying  when  multiple  modes  are  present,  and  then  incrementally  segregating 
the  population  according  to  processes.  We  will  continue  to  assume  that  ob¬ 
jects  are  represented  by  property  vectors  and  that  the  total  categorization 
uncertainty  U  is  used  to  measure  the  degree  to  which  a  categorization  reflects 
the  natural  modes.  U  is  defined  as  follows: 


U{Z)  =  (1  -  A)  Up{Z)  +  A  Ti(Z)  Uc{Z)  (6.1) 

where  Up  is  the  uncertainty  about  the  properties  of  an  object  once  its  cat¬ 
egory  is  known,  Uc  is  the  average  uncertainty  of  the  category  to  which  an 
object  belongs,  r?  is  a  normalization  coefficient  between  Up  and  Uc,  and  A  is  a 
free  parameter  representing  the  desired  trade-off  between  the  two  uncertain¬ 
ties.  (See  chapter  4  for  a  complete  definitions  of  these  terms.)  In  this  chapter 
we  will  consider  the  interaction  between  A  and  the  categories  recovered  by 
the  observer.  Also,  the  behavior  of  the  incremental  categorization  algorithm 
presented  in  chapter  5  will  be  used  to  validate  the  theory  developed  here. 

Our  first  step  is  to  identify  a  null  hypothesis,  a  case  in  which  no  modal 
structure  is  present.  Being  able  to  detect  an  unstructured  situation  will 
permit  us  to  develop  a  strategy  that  relies  on  searching  for  sub-processes 
until  no  further  structure  can  be  found.  Next,  using  both  a  simulation  and 
a  real  example  of  a  multiple  mode  population,  we  will  examine  the  results 
of  attempting  to  categorize  such  a  set  of  objects;  these  results  will  suggest  a 
method  for  separating  modes  according  to  processes.  Finally,  we  will  develop 
a  method  of  evaluating  the  contribution  of  a  feature  to  the  recovery  of  a 
particular  modal  level. 

6.1  A  non- modal  world 

Natural  modes  are  an  appropriate  basis  for  categorization  because  they  rep¬ 
resent  classes  of  objects  which  are  redundant  in  important  properties.  That 
is,  from  the  knowledge  of  some  properties  of  an  object,  the  observer  can  infer 
a  natural  mode  category  that  permits  him  to  make  inferences  about  other 
object  properties.  In  an  ideally  modal  world,  the  properties  of  interest  to  the 
observer  —  those  he  encodes  about  an  object  —  are  completely  predictive: 
knowledge  of  one  property  permits  predictions  about  all  others. 

In  this  section,  however,  we  wish  to  define  an  unstructured  world,  a  world 
with  no  natural  modes.  In  such  a  world,  the  properties  of  objects  are  indepen¬ 
dent.  Knowledge  of  some  properties  about  an  object  provides  no  information 
about  any  of  the  other  properties.1  We  refer  to  this  world  as  a  non-modal 
world.  Our  goal  is  to  identify  such  non-modal  worlds  and  to  understand 


the  behavior  of  the  categorization  algorithm  and  of  the  total  uncertainty 
measure  U  when  operating  in  such  a  world. 
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6.1.1  Categorizing  a  non-modal  world:  an  example 

We  begin  with  a  simulation  of  a  non-modal  world.  We  construct  a  population 
of  64  objects,  described  by  six,  independent,  uniformly  distributed,  binary 
features.  In  this  world  an  object  is  referred  to  as  NULL-8,  NULL-24,  etc.; 
the  property  vector  attached  to  each  object  is  generated  randomly  from 
the  the  26  =  64  possibilities.  These  objects  and  features  satisfy  the  non- 
modal  condition  in  that  knowledge  about  some  of  the  properties  of  an  object 
provides  no  information  about  any  other  properties.  Next,  we  categorize 
these  objects  using  an  incremental  categorization  procedure  that  implements 
the  total  uncertainty  measure  U  as  a  categorization  evaluation  function.  (For 
a  detailed  discussion  of  the  operation  of  the  categorization  algorithm  see 
chapter  5.)  The  dynamic  output  of  the  categorization  system  is  displayed  in 
figures  Figure  6.1  and  Figure  6.2. 

Figure  6.1  presents  the  results  of  executing  the  categorization  algorithm 
with  a  value  of  A  of  .55.  Notice  that  categories  continue  to  split  into  smaller 
categories  as  new  objects  are  added;  the  last  categorization  shown  contains 
only  categories  yet  too  small  to  be  subdivided  by  the  categorization  algo¬ 
rithm.  In  the  limit,  the  finest  categorization  —  the  categorization  in  which 
each  object  is  its  own  category  —  would  be  selected.  Because  reducing  the 
value  of  A  causes  the  categorization  algorithm  to  produce  only  finer  cat¬ 
egorizations,  we  know  that  for  all  A  <  .55,  the  finest  categorization  will 
be  recovered.  Figure  6.2  displays  the  results  of  running  the  categorization 
algorithm  on  the  same  population  but  with  a  A  of  .6.  Now,  the  stable  cat¬ 
egorization  is  one  in  which  all  objects  are  in  a  single  category,  the  coarsest 
categorization  possible.  Reasoning  as  before,  we  know  that  for  A  >  .6  only 
the  coarsest  categorization  will  be  recovered.2  Thus,  for  this  simulation  of  a 

square  of  the  length.  A  more  subtle  case  is  when  one  property  is  the  area  covered  by  an 
object,  and  another  is  the  perimeter.  (The  perimeter  must  be  greater  than  or  equal  to 
2vMjt.)  An  interesting  question  is  how  the  observer  determines  whether  redundancy  is 
caused  by  natural  modes  or  logical  dependence.  A  simple,  though  unexplored,  solution 
relies  on  the  fact  that  logical  redundancies  must  be  true  for  all  objects,  whereas  modal 
dependencies  hold  only  within  the  particular  mode. 

2For  .55  <  A  <  .6  an  intermediate  categorization  may  be  recovered,  but  it  is  unstable  in 
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Figure  6.1:  The  output  of  the  categorization  system  when  categorizing  an  inde¬ 
pendent  world.  For  this  execution  the  value  of  A  is  .55.  The  categories  generated 
continually  split  once  they  sire  large  enough  to  be  subdivided  by  the  categorization 
algorithm.  In  the  limit  the  preferred  categorization  consists  of  each  object  being 
its  own  category. 
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non-modal  world,  no  structured  categorizations  are  recovered. 

The  fact  that  the  categorization  algorithm  found  no  intermediate  struc¬ 
ture  in  the  simulation  of  a  non-modal  world  is  encouraging:  the  algorithm 
should  not  impose  structure  on  a  world  that  contains  none.  A  brief  analysis 
of  the  non-modal  world  will  demonstrate  that  in  general  the  categorization 
algorithm  will  not  recover  structured  categorizations  when  no  natural  clus¬ 
ters  are  present. 

6.1.2  Theory  of  categorization  in  a  non-modal  world 

Our  goal  is  to  explain  the  behavior  of  the  categorization  algorithm  observed 
in  the  simulation  of  a  non-modal  world.  To  begin,  let  us  consider  the  best 
categorizations  one  could  construct  in  such  a  world.  For  example,  what 
would  be  the  best  set  of  two  categories  one  could  form  in  a  non-modal  world 
described  by  binary  features?  The  most  structure  one  could  impose  would 
be  to  sort  the  population  according  to  one  feature.  Because  all  the  features 
are  independent,  members  of  a  category  would  not  have  any  other  proper¬ 
ties  in  common.  Note  that  if  there  are  m  features,  then  there  are  m  possible 
categorizations  keeping  one  feature  constant.  Likewise,  the  best  set  of  four 
categories  would  be  one  in  which  two  features  are  held  constant  within  each 
category;  there  are  (?)  such  possible  categorizations.  We  refer  to  the  number 
of  features  held  constant  in  a  categorization  of  a  non-modal  world  as  the  level 
of  the  categorization.  Figure  6.3  displays  the  different  levels  of  categoriza¬ 
tions  for  a  non-modal  world  containing  only  8  objects.  Within  each  category 
is  a  property  vector  describing  the  members;  an  “x”  indicates  either  a  1  or 
a  0.  To  indicate  that  there  are  several  possible  categorizations  at  each  level, 
the  feature  held  constant  at  level  1  is  not  held  constant  at  level  2. 

To  understand  the  behavior  of  the  categorization  algorithm  in  such  a  non- 
modal  world,  let  us  consider  how  the  two  components  of  the  evaluation  func¬ 
tion  U  —  the  property  uncertainty  Up  and  the  category  uncertainty  Uc  — 
vary  as  we  change  categorization  level.  Let  us  expand  our  non-modal  world 
to  contain  128  objects  described  by  7  independent,  uniformly  distributed, 
binary  features;  we  evaluate  Up  and  Uc  for  each  possible  categorization  level 
0  through  7.  Panel  (a)  of  Figure  6.4  displays  the  results  of  the  evaluation;  Uc 
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Figure  8.4:  (a)  The  evaluation  oiUp  and  (normalized)  Uc  for  the  different  levels 
of  categorization  of  a  non-modal  world  consisting  of  128  objects.  The  level  is  equal 
to  the  number  of  features  held  constant,  which  is  also  the  logarithm  of  the  number 
of  categories,  (b  -  d)  The  value  of  U  =  (1  -  A )Up  +  rjXUc  for  A  of  .3,  .5,  and  .7. 


has  been  normalized  by  77  to  be  of  the  same  scale  as  Up.  Panels  (b-d)  display 
the  value  of  U  for  A  equal  to  .3,  .5,  and  .7.  Notice  that  as  level  increases 
—  as  more  features  are  held  constant  in  each  category  —  the  value  of  Up 
decreases  linearly  while  Uc  increases,  also  lineaxly.  Because  both  curves  are 
linear  and  because  they  are  normalized  to  be  of  the  same  scale,  the  change 
in  Up  is  equal  to  the  negative  of  the  change  in  Uc ■  As  illustrated  in  panels 
b-d,  the  net  change  in  U  depends  upon  the  value  of  A.  If  A  is  greater  than 
.5,  total  uncertainty  increases  with  level;  for  A  less  than  .5,  total  uncertainty 
decreases.  The  value  of  U  is  constant  when  A  equals  .5. 

The  linearity  of  the  graphs  of  Up  and  Uc  is  predicted  by  analysis;  these 
graphs  were  actually  generated  by  simulating  an  independent  world  and  eval¬ 
uating  different  categorization  levels.  Each  increase  in  categorization  level 
is  formed  by  keeping  one  more  feature  constant.  Thus  the  property  uncer¬ 
tainty  is  decreased  by  an  amount  reflecting  the  uncertainty  of  that  feature. 
Because  all  the  the  features  are  binary  and  uniformly  distributed,  the  de¬ 
crease  is  log  2  =  1.  The  initial  value  of  Up ,  at  level  0,  Up  is  7  log  2  =  7; 
the  final  value  is  zero.  Similarly,  with  each  increase  in  categorization  level, 
the  number  of  categories  is  doubled,  and  a  greater  number  of  properties  is 
required  to  determine  the  category  to  which  an  object  belongs.  It  can  be 
shown  that  the  increase  in  category  uncertainty  for  each  increase  in  cate¬ 
gorization  level  is  \0gV2  =  .5.  Thus,  the  value  of  77  —  the  normalization 
factor  between  the  two  uncertainties  —  is  2.  Let  L  be  the  the  categorization 
level,  and  let  L\taz  be  the  maximum  possible  level;  in  our  simulation  with 
128  objects  L\tax  =  7.  Combining  the  above  results  for  the  two  uncertainties 
and  letting  L  be  the  level  of  categorization  yields  the  following  equation  for 
the  total  uncertainty  U: 

U  =  (1  -  A)(1.0)(W  -1)4-  (2.0)A(.5)L  =  (2A  -  1)1  4-  LMax(  1  -  A)  (6.2) 

That  is,  U  is  linear  in  L,  and  the  slope  is  determined  solely  by  A.  When 
A  >  .5  the  slope  is  positive;  when  A  <  .5,  the  slope  is  negative. 

The  implication  of  the  results  in  Figure  6.4  is  that  for  an  independent 
world  the  best  categorization  is  either  the  coarsest  partition,  where  all  objects 
are  in  one  category,  or  the  finest,  where  each  object  is  its  own  category. 
Which  is  preferable  depends  on  the  value  of  A;  for  a  value  of  .5  the  decision 
is  arbitrary.3  Thus,  we  have  explained  the  behavior  of  the  categorization 


3Tlie  simulation  generated  a  critical  A  greater  than  .5  because  of  an  implementation  mod- 


algorithm  displayed  in  the  previous  section.  We  have  also  demonstrated  the 
competence  of  the  incremental  categorization  algorithm  in  the  non-modal 
world:  the  algorithm  generates  the  correct  solution. 

The  importance  of  the  above  result  is  that  it  allows  the  detection  of  the 
attempt  to  categorize  a  non-modal  world.  Because  such  a  world  contains 
no  structure,  there  are  no  natural  clusters,  and  no  attempt  at  recovering 
modal  categories  should  be  made.  In  the  next  section  we  will  make  use  of 
this  diagnostic  capability. 

6.2  Multiple  Modal  Levels 

6.2.1  Categorization  stability 

A  ideal  modal  world  is  one  in  which  all  the  properties  are  predictive  of  the 
natural  classes  and  all  the  classes  are  predictive  of  the  properties.  The  to¬ 
tal  uncertainty  measure  U  has  been  constructed  such  that  categorizations 
exhibiting  a  high  degree  of  modal  structure  will  produce  low  values  of  un¬ 
certainty.  Because  the  incremental  categorization  algorithm  developed  in 
chapter  5  makes  use  of  U  as  a  categorization  evaluation  function,  the  algo¬ 
rithm  recovers  categories  corresponding  to  modal  clusters.  To  overcome  the 
combinatoric  problems  in  generating  possible  partitions  of  a  population,  the 
algorithm  is  stochastic,  and  thus  is  not  guaranteed  to  find  the  correct  modal 
solution. 

However,  the  more  modal  a  population  —  the  closer  the  classes  in  the 
population  approximate  modal  classes  —  the  more  reliable  and  repeatable  is 
the  categorization  process;  the  modal  categories  will  be  recovered  more  often. 
This  increase  in  reliability  occurs  because  approximate  categorizations,  those 
that  nearly  separate  the  natural  classes,  are  accepted  during  the  incremental 
search  for  the  best  solution.  Also,  given  an  approximately  modal  world,  a 
relatively  wide  range  of  A  will  result,  with  high  probability,  in  the  incremental 
categorization  algorithm  discovering  the  modal  categories.  We  use  the  term 
categorization  stability  to  refer  to  the  degree  to  which  the  recovery  of  a  set 

ification  to  the  categorization  algorithm.  Specifically,  merging  occurs  only  if  a  merged 
categorization  is  sufficiently  better  than  the  split  categorization,  where  “sufficiently”  is 
determined  by  threshold.  Using  a  non-zero  threshold  imparts  some  hysteresis  to  the 
system  and  overcomes  instability  caused  by  numerical  inaccuracies. 
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Table  6.1:  Leaf  specifications  for  four  species  of  leaves.  A  leaf  generator  creates 
property  vectors  consistent  with  the  different  specifications. 


of  categories  is  repeatable  and  insensitive  to  changes  in  A. 

As  an  example  of  a  population  exhibiting  categorization  stability  we 
present  the  example  of  categorizing  leaves.  For  this  example,  the  property 
specifications  for  four  species  of  leaves  are  generated  according  to  descrip¬ 
tions  provided  by  Preston  [1976].  (Table  6.1)  The  properties  chosen  are 
known  to  be  diagnostic  of  leaf  species  and  thus  are  sufficient  for  the  catego¬ 
rization  task.  (For  an  explanation  of  the  details  of  object  generation  using 
property  specifications  see  chapter  5.) 

When  categorizing  a  population  of  objects  generated  according  to  these 
specifications,  the  correct  categories  are  reliably  recovered  for  .5  <  A  <  .75. 4 
Figure  6.5  displays  the  results  for  two  such  executions  of  the  algorithm;  for 
these  examples  A  was  set  to  .6  and  .55.  In  both  cases  the  correct  categories, 
those  corresponding  to  the  species,  are  recovered. 

Outside  this  stable  range,  however,  the  correct  categories  are  not  recov¬ 
ered.  Values  of  A  near  one  cause  the  uncertainty  measure  to  prefer  coarse  cat¬ 
egories,  with  high  property  uncertainty  but  low  categorization  uncertainty. 
In  the  case  of  the  leaves,  values  of  A  greater  than  .75  make  it  unlikely  that 
the  categorization  algorithm  will  discover  a  partition  into  two  categories  that 
has  a  lower  uncertainty  than  having  the  entire  population  in  only  one  cate¬ 
gory.  Thus  the  recovered  categorization  contains  only  one  category.  (For  a 

4An  implementation  detail:  A  A  of  .75  is  only  successful  when  recovered  categories  are 
internally  re-categorized,  as  discussed.  The  re-categorization,  however,  does  not  involve 
the  re-computation  of  the  normalization  coefficient  r\.  Doing  so  would  have  the  effect  of 
reducing  A  because  tj  decreases  as  a  population  is  reduced,  and  its  value  directly  multiplies 
the  Uc  term  in  U . 
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Figure  8.5:  Correctly  categorizing  four  species  of  leaves.  For  these  executions 
of  the  categorization  algorithm  the  value  of  X  was  set  to  .6  (a)  and  .55  (b).  The 
identical  results  indicate  the  categorization  is  stable. 
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more  complete  analysis  of  the  competence  of  the  categorization  algorithm, 
see  Section  5.3.)  Conversely,  for  A  below  .5,  the  algorithm  produces  cat¬ 
egories  that  are  continually  split,  in  the  limit  yielding  the  finest  possible 
categorization.  As  shown  in  the  previous  section,  this  behavior  is  indicative 
of  no  useful  internal  structure;  within  each  species,  the  property  variations 
are  independent.5  This  result  is  “correct”  because  the  method  that  generates 
objects  from  property  specifications  does  so  by  generating  each  property  in¬ 
dependently,  and  there  is  not  structure  present  below  the  species  level.  This 
world  of  leaves  has  only  one  modal  level  of  structure. 

To  summarize  the  leaves  example,  we  have  one  region  of  A  that  produces 
a  stable  categorization.  Also,  values  of  A  outside  this  range  produce  no 
useful  category  structure  and  the  behavior  of  the  categorization  algorithm 
mimics  the  case  when  the  world  is  independent.  Recall  that  the  setting  of  A  is 
established  by  the  observer  according  to  his  goals.  The  value  must  be  selected 
such  that  the  categories  provide  a  balance  between  the  property  uncertainty 
and  categorization  uncertainty  which  satisfies  the  inference  requirements  of 
the  observer.  If  the  A  of  the  observer  lies  within  the  stable  range  for  the 
leaves  world,  then  the  correct  categorization  for  him  to  recover  is  exactly  the 
four  species.  If,  however,  the  observer’s  particular  value  of  A  lies  outside  this 
range  then  there  exists  no  natural  clustering  of  the  objects  that  adequately 
supports  his  goals. 

6.2.2  Multiple  mode  examples 

Wc  began  this  chapter  by  noting  that  the  Principle  of  Natural  Modes  does 
not  imply  the  existence  of  a  unique  clustering  of  objects.  Rather,  clusters 
which  occur  at  different  levels  mirror  the  different  levels  of  processes  acting 
upon  the  objects.  In  this  section  we  will  explore  the  behavior  of  the  catego¬ 
rization  algorithm  in  the  case  of  multiple  levels  of  modal  processes.  First,  we 
examine  the  results  of  attempting  to  categorize  a  real  domain  in  which  two 
levels  of  processes  are  apparent;  these  results  will  suggest  that  the  proper¬ 
ties  constrained  by  the  higher  level  process  prevent  the  discovery  of  the  lower 
level  process.  A  simulation  of  an  idealized  two-process  world  will  produce 

5The  behavior  that  is  expected  is  the  continual  splitting  of  categories.  The  fact  that  it 
occurred  around  .5  is  not  significant  and  is  purely  a  function  of  the  data.  The  critical 
value  of  5  derived  in  the  previous  section  only  applies  when  the  entire  population  is 
independent. 


Table  6.2:  The  property  specifications  for  six  different  species  of  bacteria.  Because 
most  of  the  real  features  have  only  one  value  per  species,  two  noise  features  are 
added  (nfl  and  nf2). 


similar  behavior  in  the  categorization  algorithm,  confirming  the  masking  of 
the  lower  level  modes  by  the  higher  level  properties. 

The  domain  of  infectious  bacteria  will  serve  to  illustrate  the  behavior 
of  the  categorization  algorithm  in  a  two-process  world.  Property  specifi¬ 
cations  for  six  different  species  of  bacteria  are  encoded  according  to  data 
taken  from  [Dowell  and  Allen,  1981];  Table  6.2  displays  the  specifications  for 
these  species.  Because  most  of  the  real  features  take  on  only  one  value  per 
species  (unlike  the  leaves  where  features  like  “length”  and  “width”  varied 
greatly)  two  noise  features  are  added  (nfl  and  nf2).  These  features  prevent 
all  objects  of  the  same  class  from  having  identical  property  descriptions. 

Of  the  six  species,  three  are  from  the  genus  bacieroides\  these  are  abbre¬ 
viated  as  BF,  BT ,  and  BV.  The  other  three  —  FM,  FN ,  and  FV —  are  from 
the  genus  fusobacterium.  Notice  that  several  of  the  features  of  the  speci¬ 
fications  are  determined  by  the  genus,  while  others  are  determined  by  the 
species.  For  example,  all  members  of  bacteroides  have  the  property  “gr-kan  = 
R”  (coding  for  “growth  in  presence  of  Kanamycin  is  resistant”).  Other  prop¬ 
erties,  such  as  “dole,”  vary  between  the  species,  ignoring  genus  boundaries. 
Still  others,  such  as  “gr-rif,”  are  confounded  between  levels.  These  property 
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specifications  are  used  to  generate  12  instances  of  each  type  of  bacteria. 

Figure  6.6  displays  the  results  of  categorizing  this  population  with  val¬ 
ues  of  A  of  .65  and  .7.  Notice  that  in  both  cases  the  bacteria  have  been 
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Figure  8.6:  Categorizing  bacteria.  In  these  examples  A  equals  (a)  .65  and  (b)  .7. 
The  categories  recovered  correspond  to  the  different  genera. 


ft 


FM-40  j IBV-33 

BV— 34 

BV— 35 

BV-31 

i BV-32  BT-1 9  BV— 28  BT-22 

in^-37 

PB-59 

PM-41 

TV-41 

1P44-40 

FB— 43 

FM-44 

FB-44 

1 BV-27 

BV-2S 

BV-30 

BV-  34 

1 BT-20  BT-ll  BT-1 2  BT-21 

IPB-S1 

rv-71 

PV-49 

FB-S2 

1  FV-44 

FV-70 

FV-72 

rn-4  3 

1 BV-24 

IK-11  U-l  K-l  W-t 

PB-49 

PB-55 

rV-47 

lFB-41 

FB-39 

FM-3B 

FB— 42 

I BT-24  BT-10  BT-1 4  BT-S 

1PB-S4 

m-47 

FV-44 

PH-38 

1 

J 

IBT-l  BP -4  BT-1B  BT-23 

IPB-33 

PV-4S 

PB-57 

FB- 40 

1 

1 

1 BT-1 4  BT-1 5  BT-1 7 

irv-48 

FV-43 

PB-30 

1 

im-4i  w-jo  r»-s?  re-M  rv-ss 


■MOtinMi  W-M  fb-S7  fb-40  m-ss 


r-i2rir»-4i  w-m  pb-37  fb-40  ra-ss 


IBT-4  BT-10  BF-3  BT-22  BV-27 


IW-20  »•«  BT-10  BT-3  BT-22  BV-27 


IBF-12  BT-20  BT-1 1  BT-3  BT-22  BV-27 


-14: IBT-14  BV-34 
I BT-1 0  BV-24 
JBT-7  BV— 31 
IBV-J2  BT-11 
IBT-13  BV-24 
IBV-34  BV— 25 
IBT-24  BT-11 
t  BT-1 4  BT-21 
I BT— 23  OT— 4 
l BV- 33  BT— 17 
IBT-2  BT-f 
IBT-l  BV— 33 
IBT-3  BV— 29 
IBT-B  BT-1 5 
I IT- 12  BT-20 
IBT-4  BT-10 
IBT-3  BT-22 
I BV-27 


ifw-si  r b-sc 

IFB-SS  FB-37 

ifb-oo 


irv-41  rv-7o 
irv-72  m-52 
irp-so  rw-50 


irB-34  rv-«3 
ifv-4s  r*-53 
IFV-40 


I  FB-40  FB-30 
I  PB-37  FB-39 
im-44  m-4» 
I  nt-47  FB-43 


IFV-49  TV— 42 
IBV-30  FB-39 


IFV-71  TV-47 
IPW-4S  TV-44 


Figure  8.7:  Unstable  categorization  of  bacteria.  In  these  examples  A  equals  0.5. 
The  categories  recovered  do  not  correspond  to  either  the  genera  or  the  species. 
For  example,  in  (a)  the  left  most  category  corresponds  to  all  instances  of  of  one 
species,  but  the  other  three  categories  are  mixtures  of  species  from  one  genus. 
This  categorization  is  unstable;  repeating  the  categorization  procedure  (b)  yields 
a  different  composite. 
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Table  0.3:  The  property  specifications  for  a  simulated  two  process  world.  Five 
features  are  modal  for  the  genera  (gfl  -  gf5),  two  are  modal  for  the  species  (sfl 
and  sf2),  and  two  are  noise  (nfl  and  nf2). 


To  test  the  validity  of  this  diagnostic  inference,  we  simulate  an  ideal 
“two-process  modal”  world.  In  such  a  world,  some  of  the  features  used  to 
describe  the  objects  are  constrained  by  one  modal  “process,”  while  others 
are  constrained  by  a  second  modal  process.  Also,  some  noise  features  are 
included.  Table  6.3  lists  the  property  specifications  for  such  a  world.  By“two- 
process  modal”  we  mean  that  each  feature  is  modal  for  the  level  at  which  it 
operates.  For  example,  gfl  (for  “genus-feature-1”)  is  modal  for  the  different 
genera;  it  takes  on  a  different  value  for  each  of  the  two  genera.  Likewise, 
sfl  is  modal  for  each  species  within  the  genus.  For  this  particular  example 
there  are  five  features  constrained  by  the  genus  and  two  by  the  species;7  two 
noise  features  are  also  included.  The  importance  of  this  simulation  is  that 
we  know  no  features  confound  the  two  processes.  In  the  bacteria  example 
certain  features  (such  as  “gr-rif”)  appeared  to  be  affected  by  both  genus  and 
species.  If  the  categorization  algorithm  produces  unstable  categorizations  in 
this  simulation  we  know  that  such  confounding  properties  are  not  required 
to  produce  this  behavior  and  that  unstable  categorizations  are  an  indicator 
of  competing  modal  levels. 
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Figure  6.8:  The  categorization  of  the  simulated  two-process  modal  world  yields 
unstable  categories  when  A  equals  .5  The  genus  of  each  object  is  indicated  by  the 
first  digit  in  the  name;  the  species,  by  the  second. 


A  population  of  60  objects,  10  per  species,  is  generated  according  to  the 
specifications  of  Table  6.3.  The  population  is  then  categorized  for  a  wide 
range  of  A.  For  A  greater  than  .8,  the  categorization  algorithm  produces 
only  single-category  categorizations.  For  .7  <  A  <  .8  a  categorization  cor¬ 
responding  to  the  simulated  genera  is  recovered.  This  categorization  was 
highly  stable  in  that  it  was  recovered  on  almost  all  categorization  attempts 
for  A  within  this  range.  However,  for  .5  <  A  <  .6  unstable  categorizations 
are  formed;  the  categories  recovered  are  approximately  the  union  of  some  of 
the  different  species.  Figure  6.8  displays  the  result  of  two  executions  of  the 
categorization  algorithm  with  A  equal  to  .5.  The  algorithm  was  interrupted 
after  each  of  the  objects  had  been  viewed  once.  In  Figure  6.8a,  the  four 
resulting  categories  closely  correspond  to  the  those  that  would  be  formed  by 
combining  the  GS21  species  with  the  GS23  species  (both  are  of  the  same 
genus)  as  well  as  combining  GSll  with  GS13.8  In  the  other  categorization 
attempt  (Figure  6.8b)  a  different  set  of  categories  is  generated.  Finally,  for 
A  <  .4  the  algorithm  produces  the  finest  possible  categorization,  yielding  no 
structured  categories  (not  shown). 

The  results  of  this  simulation  support  the  conclusion  that  multiple  modal 
levels  yield  unstable  categorization  behavior.  Unfortunately,  we  cannot  in¬ 
voke  the  power  of  evolution  to  prevent  this  situation  from  arising.  The 
observer  must  be  able  to  recover  the  natural  categories  corresponding  to 
a  sufficiently  structured  process  level  to  support  necessary  inferences.  For 
example,  if  the  differences  in  the  species  of  bacteria  are  important  to  the 
observer  then  he  will  need  to  encode  properties  that  discriminate  between 
genera  as  well  as  properties  that  can  distinguish  the  species  within  the  genera. 
Thus,  the  observer  requires  a  method  of  recovering  the  natural  categories  at 
each  process  level  in  the  hierarchy  until  the  appropriate  level  is  achieved. 
In  the  next  sections  we  will  describe  a  possible  method  for  recovering  the 
natural  categories  at  each  level,  and  for  determining  the  processes  associated 
with  each  property  encoded  by  the  observer. 

8If  the  algorithm  were  permitted  to  continue,  subsequent  viewing  of  the  objects  would  help 
correct  prior  mistakes. 


6.3  Process  separation 

6.3.1  Recursive  categorization 

Let  us  continue  the  simulation  example  of  the  previous  section.  If  the  goals 
of  the  observer  require  that  he  recover  more  than  just  the  genera,  then  he 
needs  a  method  by  which  to  separate  the  species.  We  know  that  reducing 
the  value  of  \  below  the  stable  value  that  recovered  the  genera  will  not  cause 
the  categorization  algorithm  to  reliably  recover  the  species;  the  interaction 
between  the  species  and  genera  cause  composite  categories  to  be  formed. 

Suppose,  however,  we  consider  each  genus  separately,  as  its  own  world  of 
objects.  In  that  case,  the  world  contains  only  three  classes  of  objects,  namely 
the  three  species.  Unlike  the  entire  population,  which  contained  more  than 
one  modal  level,  there  is  only  one  level  of  structure  present  in  this  world.  If 
the  observer  can  recover  the  modal  structure  within  this  population,  then, 
by  applying  the  procedure  to  both  genera,  he  would  be  able  to  recover  the 
categories  corresponding  to  the  species. 

In  this  world  there  are  three  types  of  features:  modal ,  where  each  feature 
takes  on  a  different  value  for  each  class;  noise ,  where  the  value  is  inde¬ 
pendent  of  the  class;  and  constant,  where  the  value  never  varies.  The  only 
differences  between  this  world  and  previous  examples  in  which  there  was  only 
one  modal  level  are  the  constant  features.  For  example,  the  leaves  domain 
contained  several  highly  diagnostic  (almost  modal)  features  (such  as  ‘"apex” 
and  “base”)  as  well  as  mostly  unconstrained  features  (such  as  “length”).  We 
know  that  the  categorization  algorithm  can  successfully  categorize  such  a 
population.  However,  we  need  to  consider  the  effect  of  the  constant  features 
on  the  category  recovery  procedure.  In  particular,  how  does  the  presence  of 
the  constant  features  affect  the  components  of  the  categorization  evaluation 
function  U? 

It  can  be  shown  that  constant  features  have  no  effect  on  either  Up  or 
Uc -9  Therefore,  if  we  treat  the  genus  members  as  separate  population,  we 

9First,  consider  Up.  Constant  features  have  no  uncertainty:  1  •  log  1  =  0.  Thus,  Up  is 
unaffected  by  the  addition  of  constant  features.  Next  consider  Uc-  At  first  one  might 
expect  the  addition  of  constant  features  to  add  to  category  uncertainty:  constant  features 
are  structure  shared  by  all  objects,  leading  to  category  confusion  We  can  show  that  this 
is  noi  the  case. 

Let  us  assume  we  have  a  world  c  classes,  and  that  objects  are  described  by  in  modal 
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expect  that  the  categorization  algorithm  to  be  able  to  recover  the  species. 
We  test  this  conclusion  by  executing  the  categorization  algorithm  on  the 
members  of  the  first  genus  of  the  simulation  example  of  the  previous  section; 
Figure  6.9  displays  the  results  of  two  categorization  attempts,  with  A  set  to 
.65  in  both  cases.  Notice  that  the  correct  species  categories  are  recovered;  the 
isolated  errors  would  be  corrected  in  subsequent  viewing  of  the  incorrectly 
categorized  objects.  Repeated  application  of  the  categorization  algorithm 
shows  this  categorization  to  be  stable.  Thus,  the  observer  can  reliably  recover 
the  species  categories  once  the  genera  have  been  separated. 

We  can  test  this  procedure  in  the  domain  of  the  bacteria  as  well.  Figures 
6.10  and  6.11  display  the  results  of  re-categorizing  the  genera  bacttroides 
and  fusobacterium,  respectively.  For  the  bacteroides  the  value  of  A  is  .6;  for 
fusobacierium,  .5.  (In  a  moment,  we  will  discuss  the  effect  on  A  caused  by 
recursively  categorizing  the  genera.)  In  both  of  these  cases  the  categorization 
procedure  reliably  recovers  the  species.  In  the  previous  section,  we  demon¬ 
strated  the  ability  of  the  categorization  procedure  to  recover  the  genera. 
Thus,  using  a  recursive  categorization  strategy,  the  observer  could  recover 
both  the  genera  and  the  species. 

Conceptually,  there  is  a  problem  with  performing  recursive  categoriza- 

features  (which  take  a  different  value  for  each  of  the  c  classes)  and  by  x  constant  features 
Now,  let  us  evaluate  the  category  uncertainty  Uc  for  the  categorization  corresponding  to 
the  modal  classes,  the  “correct”  categorization.  To  do  so  requires  computing  the  category 
uncertainty  of  each  object  for  each  possible  feature  subset  description.  We  know  that 
there  are  2m+r  possible  feature  subsets  (for  this  analysis  we  must  include  the  null  set). 
Because  the  m  features  are  modal,  if  any  of  those  features  are  included  in  the  description 
of  an  object,  then  there  is  no  category  uncertainty.  If  however,  there  are  no  modal  features 
in  the  description,  then  the  object  may  match  any  category  and  the  uncertainty  is  logc. 
The  number  of  feature  subsets  containing  no  modal  features  is  2T  Thus  the  average 
category  uncertainty  of  each  object  (and  thus  for  complete  average)  is: 

2Z  1 

Uc  —  r. — ; —  logc  —  —  logc 

That  is,  Uc  is  independent  of  x.  The  intuition  behind  the  result  is  that  the  same  propor¬ 
tion  of  feature  subsets  contain  no  modal  information,  regardless  of  the  number  of  constant 
features  When  there  are  no  constant  features,  then  only  the  null  subset  produces  cate¬ 
gory  uncertainty.  As  constant  features  are  added,  the  number  of  possible  subsets  increases 
by  the  same  ratio  as  the  number  of  non-modal  subsets  (namely  2r).  This  is  true  for  any 
categorization  we  evaluate;  we  used  the  correct  modal  categorization  only  to  make  the  an 
analytic  computation  of  the  category  uncertainty  possible. 
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Figure  8.10:  Re-categorizing  a  population  of  the  bacteroides  bacteria.  For  these 
executions,  the  value  of  A  was  0.6.  When  the  population  is  restricted  to  only  the 
genus  members,  the  categorization  algorithm  reliably  recovers  the  species. 
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Figure  6.11:  Re-categorizing  a  population  of  the  fusobactcrium  bacteria.  For 
these  executions,  the  value  of  A  was  0.5.  When  the  population  is  restricted  to  only 
the  genus  members,  the  categorization  algorithm  reliably  recovers  the  species. 


tion,  and  that  difficulty  is  hidden  within  the  computation  of  the  normaliza¬ 
tion  factor  77.  By  restricting  the  population  to  only  one  genus,  the  property 
uncertainty  of  the  coarsest  categorization  —  the  categorization  consisting  of 
only  one  category  —  is  greatly  reduced,  thereby  reducing  the  value  of  77.  As 
shown  in  the  equation  for  U : 

U(Z)  =  (1  -  A)  Up(2)  +  A  17(2)  Uc{Z )  (6.3) 

77  directly  multiplies  the  Uc  term,  and  thus  controls  (along  with  A)  the  rel¬ 
ative  contribution  of  the  two  uncertainties.  10  Thus  the  effect  of  A  on  the 
relative  contribution  of  Up  and  Uc  to  U  is  modified  when  77  is  recomputed. 
This  change  in  the  effect  of  A  complicates  the  interpretation  of  A  as  being 
a  trade-off  between  uncertainties  established  according  to  the  goals  of  the 
observer. 

Perhaps,  then,  within  the  present  context  of  categorizing  objects  we 
should  view  A  as  a  parameter  controlled  by  the  observer  and  used  by  him 
to  probe  the  category  structure  of  the  population.  If  he  can  discover  cate¬ 
gory  structure  that  is  stable  over  a  range  of  A,  then  he  can  assert  that  these 
categories  are  more  likely  to  correspond  to  natural  processes.  As  yet,  the 
question  of  directly  relating  the  goals  of  the  observer  to  the  categorization 
process  has  not  been  satisfactorily  resolved.  The  intuition  that  the  goals  of 
the  observer  specify  the  trade-off  bewteen  property  uncertainty  and  cate¬ 
gory  uncertainty  is  strong;  more  work  is  needed  to  re-examine  the  form  of 
equation  6.3  to  resolve  this  issue. 

6.3.2  Primary  process  requirement 

As  a  final  comment  about  process  separation  we  note  that  we  were  able  to 
recover  the  species  structure  only  after  recovering  the  genera.  For  recursive 
categorization  to  be  effective,  there  must  be  a  •primary  process  that  can  be 
recovered  at  each  step.  In  the  case  of  the  bacteria,  the  genera  represented 

10Unlike  decreasing  A,  lowering  the  value  of  r\  reduces  the  weight  accorded  Uc,  without 
increasing  the  weight  of  Up.  Normally,  reducing  A  increases  the  weight  of  Up  making  the 
categorization  evaluation  function  more  sensitive  to  property  variation.  This  difference, 
along  with  some  implementation  details  about  local  category  formation,  is  the  reason  that 
reason  that  simply  lowering  A  will  not  accomplish  the  separation  of  the  species,  but  that 
reducing  77  will. 
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a  strong  modal  structure  that  could  be  recovered  immediately.  In  the  sim¬ 
ulation,  we  provided  more  modal  features  for  the  genus  categories  than  for 
the  species.  This  imbalance  provided  a  primary  process  that  could  initialize 
the  recursive  categorization  procedure.  Therefore,  if  the  observer  categorizes 
objects  in  a  multiple  modal  world  by  applying  the  categorization  algorithm 
recursively,  then  he  must  be  provided  a  with  a  representation  that  permits 
him  to  discover  a  primary  process  at  each  level. 

6.4  Evaluating  and  Assigning  Features 

The  emphasis  of  this  chapter,  indeed  this  thesis,  has  been  on  the  recovery 
of  natural  object  categories.  We  have  demonstrated  that  the  observer  must 
be  provided  with  a  representation  that  reflects  the  modal  structure  present 
the  world.  However,  it  would  be  desirable  for  the  observer  to  be  able  to 
improve  his  representation  as  he  categorizes  the  objects  in  the  world.  By 
“improve,”  we  mean  make  the  representation  more  sensitive  to  the  natural 
categories  present.  In  the  context  of  property  vectors  this  process  would 
include  “growing”  new  features  that  are  constrained  by  the  natural  modes,  as 
well  as  assigning  computed  features  to  the  correct  modal  level.  In  this  section 
we  provide  a  mechanism  by  which  the  observer  can  evaluate  the  effectiveness 
of  features  in  terms  how  well  they  support  the  recovery  of  natural  mode 
categories. 

6.4.1  Leaves  example 

Let  us  return  to  the  leaves  example  introduced  at  the  start  of  the  chapter. 
We  assume  that  the  categorization  algorithm  has  been  executed  and  the 
correct  natural  mode  categories  —  the  species  —  have  been  discovered.  Fur¬ 
thermore,  we  assume  that  the  observer  has  determined  that  no  other  modal 
levels  exist.  Now,  we  wish  to  develop  a  method  by  which  the  observer  can 
determine  which  features  are  most  sensitive  to  the  species  categories. 

We  proceed  by  creating  a  short  taxonomy  of  the  leaves  shown  in  Fig¬ 
ure  6.12.  The  only  levels  present  are  those  which  correspond  to  solutions  to 
the  categorization  algorithm.  The  middle  level  corresponds  to  the  species, 
and  is  the  “natural”  level  of  the  taxonomy.  This  level  has  as  its  super¬ 
ordinate  the  single-category,  which  is  the  recovered  solution  for  A  near  1.0. 


The  level  below  the  species  consists  of  the  finest  possible  categorization,  the 
solution  for  A  near  zero.  We  construct  this  taxonomy  for  the  purpose  of 
evaluating  the  categorization  uncertainty  measure  U  for  a  range  of  A  and 
for  different  subsets  of  the  features.  That  is,  for  a  given  subset  of  features, 
we  will  measure  the  range  of  A  for  which  the  species  categorization  is  the 
preferred  level  of  the  taxonomy.  We  refer  to  the  extent  of  this  range  as  the 
A -stability  of  the  features  for  these  natural  categories.  We  will  use  the  A- 
stability  of  the  features  to  evaluate  their  utility  in  recovering  the  category 
structure. 

Tables  6.4  and  6.5  display  the  A-stability  values  for  different  subsets  of 
the  features.  (Not  all  subsets  are  displayed.)  For  example,  panel  (a)  of  the 
first  table  reveals  that  if  only  the  feature  “apex”  is  used  to  describe  the 
leaves,  then  for  a  range  of  A  of  .87,  the  species  level  of  the  taxonomy  is 
preferred  over  the  other  two  levels.  Notice  the  inclusion  of  the  feature  nfl,  a 
noise  feature.  This  feature,  whose  value  is  assigned  randomly  for  each  object, 
provides  a  baseline  against  which  to  compare  other  features.  Panel  (a)  shows 
all  1-feature  subsets,  and  the  A-stability  value  associated  with  each  subset. 
Notice  that  “apex,”  “base,”  and  “color”  have  a  relatively  high  stability, 
indicating  they  are  the  best  individual  features.  This  does  not  mean  that 
each  of  these  features  is  sufficient  for  the  recovery  of  the  species  categories; 
for  example,  poplar  and  elm  both  have  a  rounded  base.  Rather,  given  the 
particular  taxonomy  of  Figure  6.12  these  features  are  highly  selective  of  the 
species  level.  Notice  that  “width”  and  “length”  have  low  A-stability  values, 
indicating  little  diagnosticity  for  the  species..  Finally,  the  noise  feature  has 
no  (significant)  A-stability. 

Panel  (b)  displays  some  of  the  2-feature  subsets,  including  some  pairs 
formed  by  combining  a  good  single  feature  with  noise.  First,  notice  that  the 
best  pair  is  in  fact  the  combination  of  the  two  best  single  features.  This  does 
not  have  to  be  the  case.  For  example,  suppose  the  two  best  single  features 
provide  redundant  information,  and  that  two  other  feature  are  orthogonal 
in  their  separation  of  the  population.  In  that  case,  the  combination  of  the 
two  orthogonal  features  would  provided  a  greater  separation  of  the  classes 
and  thus  a  larger  A-stability  value.  The  less  the  features  in  a  domain  dsiaply 
this  form  of  interaction,  the  easier  it  is  to  evaluate  addtional  features.  The 
reason  for  this  is  that  the  combinatorics  of  (fc)  normally  preclude  evaluating 
all  possible  subsets  of  features.  If  the  features  combine  independently,  small 
subsets  of  features  can  be  tested  and  then  combined. 
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Table  0.4:  Selected  A-stability  measurements  for  different  subsets  of  features  in 
the  leaves  example;  subsets  of  length  1,  2,  and  3  are  shown.  By  comparing  the 
addition  of  a  new  feature  with  the  addition  of  a  noise  feature  (nfl)  one  can  judge 
the  utility  of  the  new  feature. 


Second,  note  that  the  subset  (width,  apex)  has  a  much  greater  stability 
value  than  (apex,  nfl).  This  result  demonstrates  that  although  the  “width” 
feature  by  itself  provides  little  support  for  the  recovery  of  the  natural  cate¬ 
gories,  it  does  not  act  as  destructively  as  pure  noise.  The  highly  destructive 
action  of  the  noise  feature  can  be  further  demonstrated  in  panel  (c)  of  Ta¬ 
ble  6.4.  Compare  the  triplet  (apex,  base,  nfl)  with  both  the  first  triplet  of 
(apex,  base,  color)  and  with  the  top  pair  in  panel  (b)  of  (apex,  base).  The 
extreme  reduction  in  the  stability  caused  by  the  noise  is  an  indication  to  the 
observer  that  the  feature  “nfl”  has  little  use  and  should  be  removed  from  his 
representation.  By  using  this  comparison  to  noise  strategy,  the  observer  can 
also  evaluate  the  addition  of  new  features.  Given  a  proposed  new  feature,15 
this  mechanism  provides  a  means  to  evaluate  its  utility. 

The  panels  of  Table  6.5  show  that  the  maximum  value  of  A-stability  re¬ 
mains  quite  high  (about  .9)  until  the  inclusion  of  the  last  3  features  “width,” 
“length,”  and  “nfl.”  Including  either  “width”  or  “length”  reduces  the  A- 
stability  to  .78,  and  the  inclusion  of  the  noise  feature  reduces  the  stability  to 
a  value  of  .71.  Thus,  most  of  the  leaf  features  are  relatively  uniform  in  their 
sensitivity  to  the  species  structure  of  the  leaves. 

6.4.2  Bacteria  example 

The  fact  that  the  different  features  of  the  leaves  do  not  exhibit  large  differ¬ 
ences  in  their  diagnosticity  for  the  species  is  not  surprising;  these  features 
were  chosen  from  a  leaf  identification  reference  [Preston,  1976].  There  is  only 
one  modal  level  and  each  of  the  features  was  chosen  to  be  useful  in  identifying 
that  level.  For  the  bacteria  example,  however,  there  are  two  modal  levels. 
Some  features  are  sensitive  to  the  genera,  others  to  the  species.  Therefore 
let  us  consider  evaluating  the  features  of  the  bacteria  domain. 

The  taxonomy  we  construct  resembles  that  for  the  leaves  example,  but 
it  includes  both  natural  levels  (Figure  6.13).  In  this  case  we  will  evaluate 
the  A-stability  for  each  natural  clustering.  Panel  (a)  of  Table  6.6  displays 
the  A-stability  values  for  several  feature  subsets  for  the  genus  level  of  the 

nAn  important,  and  open,  question  is  how  does  the  observer  propose  new  features?  The 
literature  is  quite  sparse  in  addressing  this  question,  with  most  attempts  being  confined 
to  arithmetic  combinations  of  previous  features  (for  example  see  Boyle  [19??]).  Recently, 
Michalski  and  his  colleagues  have  explored  the  issue  of  the  logical  manipulation  of  features 
[Michalski,  19??]. 
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Figure  6.13:  A  taxonomy  of  the  bacteria  of  Table  6.2  separated  according  to 
genus  and  species. 
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Table  0.5:  Evaluation  of  features  for  the  separation  of  the  (a)  bacteria  genera 
and  of  the  species  (b)  bacteroides  and  (c)  fusobacterium. 
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bacteria  taxonomy;  included  are  the  best  feature  subsets  of  length  3,4,  and 
5.  Notice  that  the  best  subset  of  length  3  (location,  gr-kan,  glc)  has  a  A- 
stability  value  of  1.0;  this  maximum  value  occurs  because  these  features  are 
modal  for  the  different  genera  (see  Table  6.2).  The  best  subset  of  length  4 
adds  the  feature  “gr-pen,”  but  this  lowers  the  stability  value  to  .86.  Notice 
that  second  best  subset  of  length  4  reduces  that  value  to  .71.  Finally,  the 
best  subset  of  length  5,  generated  by  including  “gr-rif,”  is  only  marginally 
better  than  a  subset  generated  by  adding  a  noise  feature  to  the  best  length  4 
subset  (.72  as  opposed  to  .66).  These  results  demonstrate  that  the  features 
(location,  gr-kan,  glc,  gr-pen)  are  the  features  most  constrained  by  the  pro¬ 
cesses  associated  with  the  genus  of  the  bacteria,  and  that  the  other  features 
provide  little  useful  genus  information. 

Next,  we  evaluate  the  separation  of  species  within  the  genus.  For  the 
genus  bacteroid.es ,  the  features  “gr-rif”  and  “bile”  are  constant,  providing  no 
information  about  the  species.  Thus  the  only  remaining  features  are  “rham,” 
“esculin,”  and  “dole.”  Since  each  of  these  take  on  one  value  for  one  of  the 
species  and  another  value  for  the  other  two,  and  because  they  each  single  out 
a  different  one  of  the  three  species,  these  three  features  behave  identically 
with  respect  to  A-stability.  This  behavior  is  indicated  in  panel  (b)  of  Table 
6.6.  For  the  fusobacterium  genus,  however,  the  features  do  have  a  differential 
effect.  As  shown  in  panel  (c),  the  A-stability  remains  quite  high  (about  .9) 
for  the  best  feature  subsets  of  length  4  or  less.  However,  the  best  subset 
of  length  5  requires  the  addition  of  feature  “gr-pen”  and  the  A-stability  is 
greatly  reduced  (.76).  As  “gr-pen”  was  one  of  the  features  discovered  to  be 
important  for  the  separation  of  the  genera,  we  know  that  this  feature  crosses 
the  modal  levels,  and  thus  is  a  weak  feature  for  the  species  clustering. 

We  can  summarize  the  results  of  the  bacteria  example  by  displaying  an 
annotated  taxonomy  of  the  domain  (Figure  6.14).  The  features  tied  to  the 
branches  represent  the  properties  of  the  objects  constrained  by  the  natural 
processes  responsible  for  that  particular  natural  division.  In  essence,  the 
observer  has  learned  not  only  to  identify  the  natural  categories,  but  also  to 
relate  the  properties  of  objects  to  to  the  natural  processes  in  the  world. 
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7.1  Summary 

We  began  this  thesis  with  the  following  three  questions: 

•  What  axe  the  necessary  conditions  that  must  be  true  of  the  world  if 
a  set  of  categories  is  to  be  useful  to  the  observer  in  predicting  the 
important  properties  of  objects? 

•  What  are  the  characteristics  of  such  a  set  of  categories? 

•  How  does  the  observer  acquire  the  categories  that  support  the  infer¬ 
ences  required? 

Let  us  consider  each  in  turn. 

The  first  question  is  about  the  world.  What  needs  to  be  special  about  the 
world  if  the  observer  is  to  be  able  to  make  inferences  about  the  important 
properties  of  objects?  As  an  answer,  we  proposed  the  Principle  of  Natural 
Modes:  the  interaction  between  the  processes  that  create  objects  and  the 
environment  that  acts  upon  them  causes  objects  to  cluster  in  the  space  of 
properties  important  to  their  interaction  with  the  environment.  The  impor¬ 
tance  of  this  claim  is  that  without  such  a  constraint,  many  of  the  perceptual 
inferences  that  are  necessary  to  the  survival  of  the  observer  cannot  be  made. 
This  statement  is  true  even  at  the  lowest  levels  of  perception.  For  example, 
consider  the  method  by  which  a  tick  finds  a  host.  The  tick  climbs  onto  a 
branch  or  blade  of  tall  grass,  waits  until  it  detects  the  presence  of  buteric  acid 
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(a  chemical  produced  by  warm  blooded  animals),  then  releases  the  branch 
(or  jumps)  and  falls  towards  the  ground.  If  no  host  is  underneath,  the  tick 
starts  again.  Now  let  us  consider  the  tick’s  strategy  in  terms  of  natural 
modes.  All  mammals  have  many  biological  processes  in  common  that  are 
unique  to  mammals;  as  such,  mammals  form  a  natural  mode.  The  tick’s 
strategy  is  an  effective  one  because  the  presence  of  buteric  acid  is  strong 
indicator  of  the  proximity  of  a  mammal.  Although  one  can  view  this  infer¬ 
ence  simply  as  a  high  correlation  statistic,  the  underlying  reason  why  the 
strategy  of  the  tick  is  successful  is  because  buteric  acid  is  a  good  predictor 
of  an  object  belonging  to  the  natural  mode  of  mammal. 

Although  the  inferences  that  must  be  made  by  a  human  observer  may  be 
more  varied  and  more  complex  than  those  of  a  tick,  the  principles  underlying 
the  predictions  of  unobserved  properties  are  no  different.  Given  an  apple, 
we  know  we  can  eat  it.  Given  a  tiger,  we  know  to  run.  The  necessary 
requirement  for  being  able  to  make  these  inferences  is  that  we  must  be  able 
to  determine  the  natural  mode  to  which  an  object  belongs.  The  categories 
we  use  to  describe  these  objects  must  be  consistent  with  the  natural  mode 
structure  of  the  world. 

The  existence  of  natural  modes  allows  us  to  define  the  problem  of  cat¬ 
egorization,  namely  the  recovery  of  object  categories  corresponding  to  the 
natural  modes  important  to  observer.  Our  solution  to  this  problem  required 
decomposing  the  task  into  two  components.  First,  the  observer  must  be  able 
to  identify  when  a  set  of  categories  corresponds  to  natural  classes,  and  sec¬ 
ond,  he  must  be  able  to  recover  such  set  of  categories  from  the  available 
data.  These  two  components  provide  the  answers  to  the  second  and  third 
questions  this  thesis  sought  to  address. 

We  constructed  a  measure  of  how  well  a  set  of  categories  reflected  the 
natural  modes  by  measuring  how  well  the  categories  supported  the  inference 
requirements  of  the  observer.  We  argued  that  if  a  set  of  categories  satisfied 
the  goals  of  the  observer  and  permitted  him  to  make  the  necessary  inferences 
about  the  properties  of  objects,  then  that  set  of  categories  must  capture  the 
structure  of  the  natural  modes. 

In  our  analysis  of  the  goals  of  the  observer  and  of  the  characteristics  of 
a  set  of  categories  that  support  those  goals,  we  identified  two  conflicting 
constraints.  First,  the  observer  requires  that  knowledge  of  the  category  of 
an  object  be  sufficient  to  make  strong  inferences  about  the  properties  and 
behavior  of  that  object.  This  requirement  favors  the  formation  of  fine,  ho- 


mogeneous  categories.  Such  categories  are  highly  structured  and  thus  convey 
much  information  about  the  properties  of  their  members.  Larger  categories 
have  less  constrained  properties  and  thus  the  observer  has  a  greater  property 
uncertainty  once  the  category  of  an  object  is  known.  Second,  the  inferences 
made  by  the  observer  must  be  reliable;  thus,  he  requires  that  the  assignment 
of  an  object  to  a  category  be  a  robust  process.  Such  a  constraint  favors  the 
formation  of  coarse  categories,  where  few  properties  axe  needed  to  determine 
category  membership.  The  coarser  a  set  of  categories,  the  easier  it  is  to  de¬ 
termine  category  membership  of  an  object;  there  is  less  category  uncertainty 
for  given  object. 

Therefore,  the  observer  is  faced  with  a  trade-off  between  the  ease  of  mak¬ 
ing  an  inference  about  an  object  and  the  specificity  of  the  inference.  To  make 
this  trade-off  explicit,  we  derived  a  measure  for  each  of  the  two  uncertainties 
(based  on  information  theory)  and  combined  them  using  a  free  parameter 
A  as  a  relative  weight.  This  combined  measure  —  referred  to  as  the  total 
uncertainty  of  a  categorization  —  allowed  the  observer  to  explicitly  control 
the  balance  of  uncertainty.  If  the  observer  requires  precise  inferences,  a  low 
value  of  A  selects  tightly  constrained  categories;  these  categories  provide  the 
necessary  inferential  power,  but  at  the  expense  of  requiring  detailed  informa¬ 
tion  about  an  object  to  determine  the  category  to  which  it  belongs.  Likewise, 
if  the  observer  requires  a  robust  categorization  procedure  even  when  little 
sensory  information  is  provided,  a  high  value  of  A  causes  coarse  categories 
to  be  preferred;  they  are  easily  accessed  with  little  sensory  information,  but 
they  permit  the  observer  to  make  only  weak  inferences  about  the  properties 
of  objects. 

The  measure  we  derived  is  based  solely  on  the  goals  of  the  observer;  sets 
of  categories  which  support  the  goals  of  inference  of  the  observer  yield  lower 
total  uncertainty  than  those  that  do  not.  But  how  do  we  relate  this  measure 
to  the  natural  modes?  As  argued  above,  we  know  that  the  goals  of  inference 
of  the  observer  can  only  be  accomplished  if  he  recovers  the  natural  categories. 
It  is  the  structure  of  the  modes  that  permits  the  inference  of  unobserved 
properties  from  observed  properties.  Thus,  by  directly  measuring  how  well 
the  observer  accomplishes  his  goal,  we  are  measuring  how  successful  he  has 
been  in  recover  the  natural  categories. 

Having  constructed  a  measure  capable  of  evaluating  the  degree  to  which 
a  set  of  categories  captured  the  natural  mode  structure,  we  next  considered 
the  problem  of  recovering  the  natural  categories  from  the  data  provided 


*v 

*' 

V*, 

V 

fc 


ni 


by  the  environment.  Based  upon  the  learning  paradigm  of  formal  learning 
theory,  we  defined  a  categorization  paradigm  that  allowed  us  to  identify  the 
components  of  the  categorization  process.  This  paradigm  was  developed 
to  be  consistent  with  our  intuitions  about  the  categorization  procedure;  for 
example,  objects  are  viewed  sequentially,  with  the  observer  modifying  his 
current  categorization  of  objects  as  each  new  data  item  is  viewed. 

There  axe  three  critical  components  in  the  paradigm.  First,  is  the  repre¬ 
sentation,  the  information  encoded  by  the  observer  upon  viewing  an  object. 
If  the  representation  does  not  have  the  power  to  distinguish  between  objects 
in  different  natural  modes  —  if  the  representation  is  not  class  preserving  — 
then  the  observer  can  not  hope  to  recover  the  natural  categories.  Second, 
is  the  hypothesis  evaluation  function,  which  provides  the  criteria  by  which 
the  observer  chooses  a  particular  categorization.  The  last  component  of  the 
paradigm  consists  of  of  a  hypothesis  generation  method.  This  component, 
which  is  responsible  for  producing  categorization  hypotheses,  is  also  critical 
to  the  success  of  the  categorization  procedure.  Because  of  the  combinatorics 
of  partitioning  problem,  one  can  not  attempt  all  possible  categorizations  in 
a  world  of  many  objects.  Therefore,  one  needs  to  develop  a  procedure  that 
will  eventually  converge  to  the  correct  set  of  categories. 

Using  the  paradigm  as  a  model  we  constructed  a  categorization  proce¬ 
dure.  This  procedure  implements  the  total  uncertainty  of  a  categorization  as 
the  hypothesis  evaluation  function.  The  hypothesis  generation  method  we 
present  is  a  dynamic,  data  driven  procedure.  Upon  viewing  a  new  object, 
the  observer  produces  a  new  set  of  categories  by  modifying  the  previous  hy¬ 
pothesis.  Although  the  algorithm  is  statistical  in  nature,  and  not  guaranteed 
to  produce  the  correct  categorization,  we  have  demonstrated  its  effectiveness 
in  several  domains.  One  of  these  domains  consists  of  the  soybean  data  of 
Michalski  and  Stepp  [1983],  which  have  been  shown  to  be  challenging  for 
standard  clustering  techniques.  The  algorithm  successfully  recovered  the 
four  species  of  diseases  present  and  did  not  require  the  a  priori  knowledge  of 
the  number  of  classes  contained  in  the  population. 

Finally,  we  considered  the  case  of  a  multiple  mode  domain,  a  domain 
in  which  there  is  more  than  one  level  of  natural  structures.  The  example 
we  used  was  that  of  infectious  bacteria,  where  there  is  structure  at  both 
the  genus  and  species  level.  We  first  demonstrated  a  technique  by  which 
the  observer  could  recover  the  different  levels  present.  This  technique  re¬ 
lies  on  the  observer  being  able  to  detect  a  primary  process  level;  once  the 
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observer  discovers  these  categories  he  can  then  recursively  categorize  each 
sub-population  in  search  of  additional  structure.  To  support  such  a  pro¬ 
cedure  we  analyzed  the  case  of  attempting  to  categorize  a  world  in  which 
there  is  no  modal  structure.  By  determining  the  pathological  behavior  ob¬ 
served  in  such  a  situation,  we  provide  the  observer  with  the  necessary  halting 
conditions  for  the  recursive  strategy. 

An  important  aspect  of  the  multiple  mode  analysis  was  the  development 
of  a  method  for  evaluating  the  utility  of  a  feature  for  recovering  the  natural 
categories.  In  the  single  mode  world,  this  technique  provides  the  observer 
with  the  means  for  evaluating  new  features,  and  thus  permits  him  to  learn 
a  better  representation.  In  the  multiple  modal  world  this  technique  also 
provides  a  mechanism  for  assigning  the  different  features  encoded  by  the 
observer  to  the  different  process  levels  present  in  the  domain.  This  technique 
begins  to  address  the  fundamental  problem  of  recovering  natural  •processes 
as  opposed  to  recovering  only  the  categories  formed  by  the  processes. 

7.2  Clustering  by  Natural  Modes 

One  of  the  contributions  of  this  work  is  a  new  method  by  which  to  measure 
the  quality  of  a  set  of  categories.  The  measure  U  —  the  total  uncertainty 
of  a  categorization  —  reflects  how  well  the  categories  support  the  goals  of 
making  inferences  about  the  properties  of  objects.  How  does  this  method 
compare  to  other  clustering  techniques? 

First,  we  again  mention  that  the  categorization  procedure  based  upon 
the  uncertainty  measure  was  capable  of  successfully  categorizing  the  soybean 
data  of  Michalski  and  Stepp[1983].  In  their  work,  they  report  experiments  in 
which  they  attempted  to  categorize  those  data  using  18  different  numerical 
clustering  techniques.  Of  these,  only  4  were  successful.  Thus,  for  at  least 
this  set  of  data  the  performance  of  the  categorization  technique  is  at  least 
comparable  to  other  clustering  algorithms.  Because  the  uncertainty  measure 
has  the  desirable  property  of  being  insensitive  to  unconstrained  features,  it 
provides  a  robust  method  of  recovering  categories  in  a  domain  in  which 
irrelevant  features  contaminate  the  data.  Furthermore,  we  have  provided  a 
technique  by  which  the  relevance  of  a  feature  can  be  assessed  once  the  correct 
categories  are  known. 

But  more  important  than  the  performance  of  the  algorithm  is  the  basic 
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design  of  the  categorization  evaluation  function.  This  function  explicitly 
measures  how  well  a  particular  set  of  categories  supports  the  goals  of  malting 
inferences  about  the  properties  of  objects.  Unlike  standard  techniques  that 
use  a  distance  metric  which  is  assumed  to  bear  some  relation  to  the  desired 
structure  of  the  categories,  the  uncertainty  measure  directly  evaluates  the 
utility  of  the  categories  in  terms  of  the  inferences  that  can  be  supported. 
By  directly  measuring  the  degree  to  which  a  categorization  supports  the 
performance  of  the  task  of  interest  (namely  that  of  making  inferences),  we 
are  more  likely  to  discover  a  useful  set  of  categories. 


7.3  The  Utility  of  Natural  Categories: 
Perception  and  Language 


Throughout  this  thesis  we  have  motivated  the  categorization  problem  by  con¬ 
sidering  the  inference  requirements  of  the  observer.  However,  other  problems 
of  cognitive  science  are  made  less  severe  if  the  cognitive  system  can  recover 
the  structure  of  the  natural  world.  In  particular,  let  us  return  to  Quine’s 
question  of  natural  kinds  ([Quine,  1969]  and  section  2.3.1).  Quine  theorized 
that  intelligent  communication  between  individuals  is  possible  only  if  the 
individuals  share  a  common  description  of  the  world.  That  is,  the  similar¬ 
ity  space  —  the  qualia  —  of  the  individuals  must  be  identical,  or  at  least 
approximate.  Without  a  common  descriptive  space,  the  individuals  would 
not  be  able  to  resolve  the  problem  of  reference:  determining  the  extension  in 
the  world  of  some  vocabulary  term  used  or  of  some  gesture  made  by  another 
individual.  In  light  of  this  requirement  the  ability  to  recover  the  natural 
structure  in  the  world  provides  a  basis  for  communication  between  individ¬ 
uals.  By  recovering  categories  that  correspond  to  natural  classes  —  classes 
defined  by  processes  in  the  world  —  different  observers  can  be  assured  of 
convergent  world  descriptions.  If  two  observers  are  categorizing  a  popula¬ 
tion  of  objects  using  identical  categorization  evaluation  functions,  and  if  the 
categorization  function  is  appropriate  for  recovering  natural  categories  then 
the  two  observers  are  guaranteed  to  recover  similar  categories.  Therefore, 
these  observers  will  be  able  to  develop  a  common  description  of  objects  to 
serve  as  basis  for  a  mutual  vocabulary. 
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7.4  Recovering  Natural  Processes:  Present 
and  Future  Work 

The  motivation  we  have  presented  for  this  work  centers  on  the  task  of  making 
inferences  about  objects.  In  particular  we  have  argued  that  the  observer  must 
be  able  to  make  inferences  about  unobservable  properties  of  objects  given 
only  sensory  information.  This  task  led  us  to  the  Principle  of  Natural  Modes 
and  to  the  task  of  recovering  natural  object  categories. 

But  making  inferences  about  objects  is  only  a  sub-goal  of  a  much  more 
general  perceptual  goal:  understanding  the  world.  The  purpose  of  our  per¬ 
ceptual  mechanisms  is  to  convey  information  about  the  world  that  is  impor¬ 
tant  to  our  survival.  One  implication  of  this  goal  is  that  we  can  improve 
upon  the  goal  of  recovering  the  natural  categories  in  the  world.  We  know 
that  the  natural  modes  axe  caused  by  the  interaction  of  natural  processes. 
Therefore,  a  more  complete  understanding  of  the  world  is  achieved  if  we 
recover  (discover)  the  natural  processes  that  are  present  in  the  environment. 

The  last  section  of  chapter  6  —  the  chapter  concerning  worlds  with  multi¬ 
ple  natural  modes  —  demonstrated  a  technique  by  which  the  observer  could 
assign  the  different  features  to  the  different  process  levels  present  in  the  do¬ 
main.  In  the  case  of  the  bacteria,  certain  features  were  identified  as  being 
constrained  by  the  genus,  and  others  by  the  species.  This  capability  begins 
to  give  the  observer  an  understanding  of  the  natural  processes  responsible  for 
the  natural  modes.  He  does  not  only  acquire  the  modes  themselves,  but  also 
gains  the  knowledge  of  how  the  natural  processes  constrain  the  properties  of 
objects. 

One  of  the  potential  extensions  to  this  work  is  to  make  explicit  the  con¬ 
cept  of  natural  processes  and  attempt  to  recover  the  processes  directly.  We 
would  still  assume  that  classes  exist  in  the  world  —  natural  modes.  How¬ 
ever,  we  would  associate  a  generating  process  with  each  class,  responsible  for 
producing  all  the  objects  in  that  class.  Now,  we  change  the  categorization 
task  by  requiring  that  the  observer  propose  generating  processes  to  explain 
the  observed  objects. 

An  example:  Suppose  the  observer  has  seen  100  different  objects,  and  his 
task  is  to  propose  generating  processes  to  account  for  them.  As  before  he 
could  propose  a  single  category,  which  encompasses  all  objects.  This  would 
correspond  to  the  universal  Turing  machine  process  capable  of  producing 


all  objects.  Alternatively,  the  observer  could  propose  100  different  gener¬ 
ating  processes  each  capable  of  producing  only  one  object.  Just  as  with 
the  hypothesizing  of  categories,  the  observer  will  want  to  propose  those  pro¬ 
cesses  which  correspond  to  the  natural  modes,  which  permit  him  to  make 
inferences  about  properties  of  objects.  In  this  case  however,  the  observer 
has  a  vocabulary  of  processes;  he  must  know  (or  somehow  learn)  about  the 
types  of  physical  processes  that  can  occur.  In  other  words,  he  is  measuring 
his  uncertainty  about  properties  of  the  object’s  generating  process  as  op¬ 
posed  to  uncertainty  about  the  properties  directly.  This  approach  has  much 
greater  power  than  a  simple  property  vector  scheme  because  the  categories 
are  formed  by  constraint  on  their  physical  processes  as  opposed  to  constraint 
on  particular  properties.  And,  in  the  real  world,  it  is  the  processes  that  are 
constrained. 

By  searching  for  processes  directly,  we  would  reduce  the  dependence  on 
the  property  representation.  This  is  a  desirable  goal  given  the  common  belief 
that  no  simple  set  of  properties  is  going  to  be  sufficient  for  recognition; 
the  general  failure  of  standard  pattern  recognition  techniques  supports  this 
opinion.  Thus  an  alternative  approach  is  necessary,  and  we  need  to  be  able 
to  incorporate  the  ideas  and  principles  developed  in  this  thesis  into  a  more 
general  framework. 
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(baaa  (truncata)) 

(color  (light))))) 


' apaciaa-apaci fication  it  rmtlly  papar  birch 
iganua  'batula 
lapaciaa  'papyri  fara 
icoaanon-naaw  'birch 
: faatura-choica-liat 

' ( (langth  (2.0  3.0  3.0  4.0  4.0  5.0)) 

(width  (1.0  2.0  2.0  3.0)) 

(flara  (1.0)) 

(lobaa  (1.0)1 

(nargin  (doubly-aarrata) ) 

(apax  (acuta) ) 

(baaa  (roundad) ) 

(color  (dark) ) ) ) ) 


(maka-inatanca  *  apaciaa-apaci f 1 cat  ion 
:ganua  '  ulaaia 
: apaclaa  'amarlcana 
:  conaaon-naiaa  *  ala 
i faatura-choica-liat 
'((langth  (4.0  5.0  5.0  C.0) ) 
(width  (2.0  3.0  3.0  )> 
(flara  (0.0  -1.01) 

(lobaa  (1.0)) 

(aiargln  (doubly-aarrata) ) 
(apax  (accuadnata) ) 

(baaa  (roundad)) 

(color  (dark))))) 
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(dafvar  NT-i-DK  liukt-lniCinct  '  soy-apacif ication 

:connon*n<M  '»* 

■ faatura-choica-liat 
'  ( (tin*  (3  4  5  6)) 
(stand  (0)) 

(pracip  (2)) 

(tamp  <D) 

(hail  (0)) 

(yaars  (12  3)) 
(damaga  (0  1)) 
(aavarity  (1  2)) 
(traatmant  (0  1)) 
(garm  (012)) 
(haight  ID) 

<  cond  ( 1 ) ) 

(lodging  (0  II) 
(cankars  (3)) 
(color  (0  1>) 
(fruit  (1)) 

(dacay  (1)) 
(mycalium  (0)) 
(lntarn  (0)) 
(aclarotia  (0) ) 
(pod  (0)) 

(root  (0))))) 


-l-MC  (maka-inatanca 


' aoy-apacification 
i common -nama  ' b* 

« f aatura-choica-l i at 
'  ( (tiaia  (3  4  5  6)) 
(stand  (0) ) 

(pracip  (0)) 

(taaap  (1  2)) 

(hail  (0  1)) 

(yaars  (0123)) 
(daaaga  (2  3)) 
(aavarity  (1)) 
(traatmant  (0  1)) 
(gam  (0  12)) 
(haight  <1|) 

(cond  (1)) 

(lodging  (0  1)) 
(cankars  (0)) 
(color  (3)) 

(fruit  (0)) 

(dacay  (0)) 
(mycalium  (0)) 
(intarn  (2)) 
(aclarotia  (1)) 
(pod  (0)) 

(root  (0))))) 
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(dafvar  30T-C- 


(iuk«-  instance 


(dafvar  aOY-D- 


*  aoy-apacification 

< common-naaa  * c* 

: f aat  ura-choi ca-1 1 ac 

•  ( (Claw  (0  0  2  2  3  4)) 

(atand  (1110)) 
(pracip  (2)) 

(tarap  (0)) 

(hall  (0001)) 
(yaara  (01  21)) 
(daaiaga  (1)) 

(aavarlcy  (1  2)) 
(craatawnc  (0  1>) 
(gars  (  1  2) I 
(halghc  (11) 

(cond  (0) ) 

(lodging  (0  0  0  1 ) > 
(cankara  (1)) 

(color  (1)) 

(fruit  (0)) 

(dacay  <D) 

(my  cal  ilia  (0  1>> 
(intarn  (0)) 
(aclarotia  (0)) 

(pod  (3)) 

(root  (0))))) 


(maka-lnatanca  * aoy-apacification 

: faatura-choica-llat 

*  ( (tiwa  (0123)) 
(atand  (D) 

(pracip  (2)) 

(taap  (0  1)> 

(hail  (0001)) 
(yaara  (011233)) 
(damaga  (1)) 

(aavarlcy  (1  2) f 

(traatawnt  (0  1)) 
(gara  (012)) 

(haight  (1>> 

(cond  (1) ) 

(lodging  (0)) 

(cankara  (1  2) ) 

(color  (2) ) 

(fruit  (0)) 

(dacay  (0  0  0  1)) 
(aiycaliua  (0)) 

(intarn  (0)) 
(aclarotia  ( 0) ) 

(pod  (3)) 

(root  (1))))) 
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i ;  Bacteria  Specifications 


(dafvar 


(dafvar 


k-lf-IPKC  (make- instance  '  bactaria-apacl fication 
iganua  'bactaroidaa 
sapaciaa  'fragilla 
:  common -nama  '  b t 
: faatura-choica-liat 
'((location  (SI)) 

(gram  (NES) ) 

(gr-pan  (R> ) 

(gr-rif  (S) ) 

(gr-kan  (R) ) 

(dola  (nag) ) 

(aaculin  (poa)  ) 

(bila  (a) ) 

(glc  (la) ) 

;  (aallcln  (nag)) 

I  (arab  (nag)) 

(rham  (nag) ) 

(nfl  (1  2  3  4)) 

(nf2  (  1  2  3  4) ) > ) ) 

k-BT-EPEC  (maka-inatanca  ’ bactaria-apacif lcacion 
iganua  ’bactaroidaa 
lapaciaa  * thataiotamicron 
i common -nama  * bt 
1 faatura-choica-liat 
' ( (location  (SI) ) 

(gram  (NEG)> 

(gr-pan  (R) ) 

(gr-rif  (S) ) 

(gr-kan  (R) ) 

(dola  (poa)) 

(aaculin  (poa)) 

(bila  (a)) 

(glc  (la) ) 

;  (aallcln  (nag  poa)) 
i  (arab  (poa)) 

(rham  (poa)) 

(nfl  (1  2  3  4)) 

(nf2  (  1  2  3  4))))) 

B-BV-BPEC  (maka-lnatanca  * bactaria-apacif ication 
:ganua  'bactaroidaa 
:apaciaa  'vulgatua 
: common -nama  'by 
: faatura-choica-liat 
'((location  (SI)) 

(gram  (NEG)> 

(gr-pan  (R) > 

(gr-rif  (SI) 

(gr-kan  (R) ) 

(dole  (nag)) 

(aaculin  (nag) ) 

(bila  (a)) 

(glc  (la)) 

;  (aallcln  (nag  poa)) 

;  (arab  (poa) ) 

(rham  (poa)) 

(nfl  (1  2  3  4)) 

(nf2  (123  4))))) 
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Lambda  Tracking 


In  chapter  6  we  described  a  recursive  categorization  procedure  capable  of 
recovering  natural  categories  in  a  multiple  modal  world.  We  illustrated  the 
technique  using  the  example  of  anaerobic  bacteria;  both  the  genera  and 
species  levels  of  categories  were  recovered.  In  this  section  we  provide  an  al¬ 
ternative  mechanism  for  recovering  multiple  stable  structures  within  a  pop¬ 
ulation.  This  technique  has  the  desirable  property  of  providing  an  explicit 
measure  of  the  degree  of  structure  contained  within  each  separate  category. 

Recall  that  for  A  near  zero,  the  finest  possible  categorization  —  a  cate¬ 
gorization  in  which  each  object  is  its  own  category  yields  the  lowest  total 
categorization  uncertainty  U.  As  A  is  increased,  coarser,  less  homogeneous 
categories  are  preferred.  When  A  is  close  to  1.0  the  best  possible  catego¬ 
rization  consists  of  only  one  category.  Thus,  we  can  design  an  agglomerative 
clustering  technique  [Duda  and  Hart,  1973,  also  chapter  3]  which  forms  new 
categories  by  merging  previous  categories. 

The  algorithm  we  use  is  identical  to  that  introduced  in  chapter  5  except 
that  A  is  no  longer  constant.  We  begin  by  categorizing  a  population  of 
objects  with  A  set  to  some  low  value.  Such  a  setting  causes  categories  to  be 
continually  split,  yielding  a  categorization  of  many,  highly  similar  categories. 
Then,  as  new  objects  are  observed,  we  slowly  increase  the  value  of  A.  For  each 
value  of  A,  the  algorithm  is  permitted  to  execute  until  a  stable  categorization 
is  achieved.  As  the  value  of  A  increases,  categories  begin  to  merge.  Finally,  as 
A  approaches  1.0,  the  categories  are  merged  into  a  single  category.  Because 
we  can  track  the  categorization  as  the  value  of  A  changes,  we  refer  to  this 
algorithm  as  A  -tracking. 
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To  illustrate  the  behavior  of  the  algorithm,  we  will  use  the  soybean  dis¬ 
eases  introduced  in  chapter  5  and  re-presented  in  table  B.l.  A  population  of 
20  examples  of  each  species  of  disease  is  generated.  Because  the  value  of  A 
changes  over  time,  this  technique  can  only  be  applied  to  a  fixed  population; 
when  the  algorithm  terminates  at  A  equal  to  1.0,  there  is  only  one  category 
and  the  introduction  of  a  new  object  would  be  meaningless. 

Normally,  agglomerative  techniques  produce  a  dendrogram ,  a  diagram  of 
reflecting  the  dynamic  change  in  category  structure  as  the  distance  between 
categories  is  increased  (as  defined  by  some  metric).  For  the  technique  de¬ 
scribed  here,  we  will  display  the  results  of  the  execution  in  the  form  of  a 
A-space  diagram  (Figure  B.l).  For  each  value  of  A  a  qualitative  description 
of  the  categorization  produced  is  illustrated.  For  example,  at  A  of  .55,  the 
categorization  produced  consists  of  three  categories  corresponding  to  dis¬ 
eases  A,  B,  and  C,  and  several  smaller  categories  each  containing  members 
of  disease  D. 

We  begin  with  a  A  of  .35.  This  value  of  A  was  found  to  be  sufficiently 
low  to  cause  the  categorization  procedure  to  continually  split  previous  cate¬ 
gories.  When  A  increases  to  .40,  the  separate  categories  containing  members 
of  disease  B  coalesce  to  form  a  category  corresponding  to  that  disease.  No¬ 
tice  that  this  category  remains  until  A  is  raised  to  a  value  of  .95.  We  refer  to 
this  duration  as  the  A-stability  of  the  B  category.  This  version  of  A-stability 
is  different  than  that  presented  in  chapter  6  which  referred  to  the  stability  of 
the  categorization  as  a  whole.  In  this  case,  A-stability  permits  us  to  consider 
the  stability  of  each  category  individually.  In  Figure  B.l  the  four  natural  cat¬ 
egories  corresponding  to  the  four  diseases  display  a  (relatively)  high  degree 
a  A-stability  indicating  that  these  categories  correspond  to  natural  structure 
in  the  population. 

Notice,  however,  that  the  category  {C,  D}  exhibits  the  same  degree  of 
A-stability  as  the  D  category.  Such  stability  may  indicate  that  there  exists  a 
common  structure  shared  by  these  two  diseases  that  qualifies  them  as  being 
a  natural  mode.  However,  without  independent  verification  from  botanists 
we  cannot  confirm  this  hypothesis. 

Experimental  evaluation  indicates  that  the  A-tracking  algorithm  is  not 
as  powerful  a  technique  for  recovering  multiple  modal  levels  as  is  recursive 
categorization.  One  explanation  for  this  result  is  that  the  algorithm  always 
considers  the  entire  population  as  a  whole,  without  limiting  its  attention  to 
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A 

B 

D 

D 

Diaporthe 
Stem  Canker 

Charcoal 

Rot 

Rhizoctonia 
Root  Rot 

Phytophthora 

Rot 

time 

{3, 4, 5, 6} 

{3, 4, 5, 6} 

{0,2, 3, 4} 

{0,1, 2, 3} 

stand 

0 

0 

{1,0} 

1 

precip 

2 

0 

2 

2 

temp 

1 

{1,2} 

0 

{0,1} 

hail 

0 

{0,1} 

{0,1} 

0 

years 

{1,2,3} 

{0,1, 2, 3} 

{0,1, 2, 3} 

{0,1, 2, 3} 

damage 

{0,1} 

{2,3} 

1 

1 

severity 

{1,2} 

1 

{1,2} 

{1,2} 

treatment 

{0,1} 

{0,1} 

{0,1} 

{0,1} 

germ 

{0,1,2} 

{0,1,2} 

{1,2} 

{0,1,2} 

height 

1 

1 

1 

1 

cond 

1 

1 

0 

1 

lodging 

{0,1} 

{0,1} 

0 

0 

cankers 

3 

0 

1 

{1,2} 

color 

{0,1} 

3 

1 

2 

fruit 

1 

0 

0 

0 

decay 

1 

0 

1 

{0,1} 

mycelium 

0 

0 

{0,1} 

0 

intern 

0 

2 

0 

0 

sclerotia 

0 

1 

0 

0 

pod 

0 

0 

3 

3 

root 

0 

0 

0 

1 

Table  B.l:  The  property  specifications  for  four  species  of  soybean  plant  diseases. 
These  data  are  derived  from  Stepp  [1985]. 
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Figure  B.l:  The  A-space  diagram  produced  by  executing  the  A- tracking  algorithm 
on  a  population  of  soybean  diseases.  For  each  A,  a  qualitative  description  of  the 
categorization  is  illustrated. 


finding  modes  within  one  particular  category.  Thus,  only  small  ranges  of  A 
are  available  for  each  stable  categorization.  For  example,  if  there  are  four 
stable  categorizations,  then  the  maximum  A-stability  range  for  each  catego¬ 
rization  would  be  .25.  Thus,  although  A-tracking  allows  the  assessment  of 
the  degree  of  structure  present  in  each  category,  it  is  not  a  robust  mechanism 
for  recovering  multiple  modal  categorizations. 


